Wednesday, March 21, 2012

Linux Threads Through a Magnifier: Remote Threads

Source code for this article may be found here.

Sometimes, a need may rise to start a thread in a separate process and the need is not necessarily malicious. For example, one may want to replace library functions or to place some code between the executable and a library function. However, Linux does not provide a system call that would do anything similar to CreateRemoteThread Windows API despite the fact that I see people searching for such functionality. You may google for "CreateRemoteThread equivalent in Linux" yourself and see that at least 90% of the results end up with something like "why would you want to do that?" There is a certain type of people in forums, most likely, thinking if they do not have an answer, then, probably, it does not exist and no one would ever need it. Others truly believe, that if they know why, they can tell you how to do that in another way. The latest is sometimes true, but most of the time, the solution being requested is the only one acceptable and that's what people refuse to understand.

So, let's say, you need to inject a thread into a running process for whatever reason (may be you want to perform a "DLL injection" the Linux way - your business). Although, there is no specific system call to allow you that, there are plenty of other system calls and library functions that would "happily" assist you.

Unavoidable ptrace()
First time you take a look at ptrace() it is a bit frightening (just like ioctl()) - one function, lots of possible requests and go figure out when and which parameter is being ignored. In practice, it quite simple. This function is used by debuggers and in cases when one needs to monitor the execution of a process for whatever reason. We will use this function for thread injection in this article.

The first thing you would want to do is to attach to the target process:

   ptrace(PTRACE_ATTACH, pid, NULL, NULL);

PTRACE_ATTACH - request to attach to a running process;
pid - the ID of the process you want to attach to.

If the return value is equal to the pid of the target process - voila, you are attached. If it is -1, however, this means that an error has occurred and you need to check errno to know what has happened. you should keep in mind, that on certain systems you may not be able to attach to a process which is not a descendant of the attaching one or has not specified it as tracer (using prctl()). For example, in Ubuntu, since Ubuntu 10.10 this is exactly the situation. If you want to change that, however, you then need to locate your ptrace.conf file and set ptrace scope to 0.

Since I am using Ubuntu and I can only attach to a child process (unless I want some additional headache) and this is what I am going to cover in this article.

The first step, just like in case of Windows, you need to write an injector. It will load the victim process, inject the shellcode and exit. This is the simplest part and the skeleton of such loader would look like this:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/user.h>

int   main(int argc, char** argv)
   pid_t   pid;
   int     status;

   if(0 == (pid = fork()))
      // We are in the child process, so we just ptrace() and execl()
      ptrace(PTRACE_TRACEME, 0, NULL, NULL);
      execl(*(argv+1), NULL, NULL);
      // We are in the parent (injector)
      // Wait for exec in the child
      waitpid(pid, &status, 0);
      // The rest of the code comes here

   return 0;

As you can see, the loader forks and then behaves depending on the return value of the fork() function. If it returns 0, this means that we are in the child process (actually, you should check whether it returned -1, which would indicate an error), otherwise, it is a pid of the child process and we are in the parent.

The child code does not have too many things to do. All that needs to be done is to tell the OS that it may be traced and replace itself with the victim executable by calling execl().

In case of parent, the situation is much different and much more complicated. You should tell the OS, that you want to get notification when the victim process issues sys_execve by calling ptrace() with PTRACE_SETOPTIONS  and PTRACE_O_TRACEEXEC. Then you simply waitpid().

When waitpid() returns (and you should check the return value for -1, which means error), it is still not the best time to start the injection. Especially, given that you may have no idea of what is where in the victim process. The next step is to wait for a system call to occur by telling the OS (and it would be good to skip a couple of system calls, so that the victim may initialize properly):


followed by a loop:

   if(-1 == waitpid(pid, &status, 0))
      //Some error occurred. Print a message and

      //The victim process has terminated. Print a message and

      // Here comes the actual injection code. Actually, all its stages.
      // The victim process received a signal and terminated. Print a message and

   // All done.
   return 0;

You should introduce a variable to count stages. Let's name it step

Stage 0 (step = 0)
I have not mentioned it, but ptrace() would notify you twice during a system call. First time right before the system call (so you can inspect registers), the second notification would arrive right after system call's completion (so you can inspect the return value). Therefore, this time we do nothing, but resume the traced victim:


and increment the stage variable.

Stage 1 (step = 1)
Backup victim's registers, portion of victim's code that would be overwritten with your shellcode and, finally, inject your shellcode.

Use ptrace(PTRACE_GETREGS, pid, NULL, regs) where regs is a pointer to struct user_regs (declared in sys/user.h). The content of the victim's registers would be copied there.

Use ptrace(PTRACE_PEEKTEXT, pid, address_in_victim, NULL) to copy the executable code from the victim (to make a backup) and ptrace(PTRACE_POKETEXT, pid, address_in_victim, shellcode) where address_in_victim is what its name suggests (you obtain the initial value from victim's RIP on 64 or EIP on 32 bit systems). Shellcode, however, contains bytes of the code being injected packed into an unsigned long value. You, most probably, would have to make those calls for several iterations, as I do not think your shellcode would be at most 8 bytes.

The start of your shellcode will allocate memory for the thread function (unless you are going to run code that already is there).

   mov   rax, 9      ;sys_mmap
   mov   rdi, 0      ;requested address
   mov   rsi, 0x1000 ;one page
   mov   rdx, 7      ;PROT_READ | PROT_WRITE | PROT_EXEC
   mov   r10, 0x22   ;MAP_ANON | MAP_PRIVATE
   mov   r8, -1      ;fd
   mov   r9, 0       ;offset
   db 0xCC

Increment stage variable. Resume the victim process with


Stage 2 (step = 2)
Ignore all stops until

0xCC == (unsigned char)(ptrace(PTRACE_PEEKTEXT, pid,
      ptrace(PTRACE_PEEKUSER, pid, offsetof(struct user,, NULL), NULL) & 0xFF

which would mean that you have reached your break point. Check victim's rax register for return value

retval = ptrace(PTRACE_PEEKUSER, pid, offsetof(struct user, regs.rax), NULL);

and abort if it contains an error code.

You have to increment the Instruction Pointer (RIP/EIP) before letting the victim to resume:

ptrace(PTRACE_POKEUSER, pid, offsetof(struct user,,
       ptrace(PTRACE_PEEKUSER,pid, offsetof(struct user,, NULL) + 1);

Increment stage counter and 


Stage 3 (step = 3)
After allocating memory, your shellcode should copy the thread function there and, actually, create a thread (similar to this).

You should, again, ignore all stops as long as

0xCC != (unsigned char)(ptrace(PTRACE_PEEKTEXT, pid,

      ptrace(PTRACE_PEEKUSER, pid, offsetof(struct user,, NULL), NULL) & 0xFF

Once you get to this breakpoint, you know that the thread has been initiated and the injector has done what it was written for.

Now you have to restore the victim to its initial, pre-injection state by restoring the values of the registers:

ptrace(PTRACE_SETREGS, pid, NULL, regs);

and, which is even more important - you have to restore the backed up code by copying back the backed up unsigned longs.

The last thing would be detaching from the victim process:


At this point, your injector may safely exit letting the victim to continue execution.

Voila! You have just injected a thread into another process.

Output of the injector, victim program and the injected thread

P.S. Shared Object Injection (a la DLL injection)
Although, injection of executable code is quite simple, injection of shared object is a different story. Despite the fact, that Linux kernel provides sys_uselib system call, it may be unavailable on some systems. In such case, you have several options:

  • Check whether the victim uses libdl (dlopen(), dlsym() and dlclose() functions, parse the image and obtain addresses of relevant functions. However, not every program uses libdl.
  • Use sys_uselib system call. However, it may be unavailable.
  • Write your own shared object loader. This may be a real pain, but you would be able to reuse it whenever you need.

Hope this post was helpful. See you at the next.

Saturday, March 17, 2012

Linux Threads Through a Magnifier: Local Threads

Source code for this article is here.

Threads are everywhere. Even now, when you browse this page, threads are involved in the process. Most likely, you have more than one tab opened in the browser and each one has at least one thread associated with it. The server supplying this page runs several threads in order to serve multiple connections simultaneously. There may be unnumbered examples for threads, but let us concentrate on one specific implementation thereof. Namely, Linux implementation of threads.

It is hard to believe, that earlier Linux kernels did not support threads. Instead, all the "threading" was performed entirely in user space by a pthread (POSIX thread) library chosen for specific program. This reminds me of my attempt to implement multitasking in DOS when I was in college - possible, but full of headache.

Modern kernels, on the contrary, have full support for threads, which, from kernel's point of view are so-called "Light-weight Processes". They are usually organized in thread groups, which, in turn, represent processes as we know them. As a matter of fact, the getpid libc function (and sys_getpid system call) return an identifier of a thread group.

Let me reiterate - the best explanation is an explanation by example. In this article, I am going to cover the process of thread creation on 64 bit Linux running on PC using FASM (flat assembler).

Clone, Fork, Exec...
There are several system calls involved in process manipulations. The most known one is sys_fork. This system call "splits" a running process in two - parent and child. While they both continue execution from the instruction immediately following the sys_fork invocation, they have different PID (process ID) or, as we now know - different TGID (thread group ID) as well as each one gets a different return value from sys_fork. The return value is a child TGID for the parent process and 0 for the child. In case of error, fork returns -1 and sets errno appropriately, while sys_fork returns a negative error code. 

Exec does not return at all. Well, it formally has a return type of int, but getting a return value means, that the function failed. Exec* libc function or sys_execve system call are used in order to launch a new process. For example, if your application has to start another application, but you do not want or cannot, for any reason, execute system() function, then your application has to fork and the child process calls exec, thus, being replaced in memory by the new process. The execution of the new process starts normally from its entry point.

Clone - this is the function we are interested in. Clone is a libc wrapper for sys_clone Linux system call and is declared in the sched.h header as follows:

int clone(int (*fn)(void*), void *child_stack, int flags, void *arg, ...);

I encourage you to read the man page for clone libc function at or with "man clone" :-) 

We are not going to deal with clone function here. There are lots of good resources on the internet which provide good examples for it. Instead, we are going to examine the sys_clone Linux system call.

First of all, let us take a look at the definition of the sys_clone in arch/x86/kernel/process.c:

long sys_clone(unsigned long clone_flags, unsigned long newsp,
               void __user *parent_tid, void __user *child_tid, struct pt_regs *regs)

Although, the definition looks quite complicated, in reality, it only needs clone_flags and newsp to be specified. 

But there is a strange thing - it does not take a pointer to the thread function as a parameter. That is normal - sys_clone only performs the action suggested by its name - clones the process. But how about the libc's clone? - you may ask. As I have mentioned above, libc's clone is a wrapper and what is does in addition to calling sys_clone is setting its return address in the cloned process to the address of the thread function. But let us examine it in more detail.

clone_flags - this value tells the kernel about how we want our process to be cloned. In our case, as we want to create a thread, rather then a separate process, we should use the following or'ed values:

CLONE_VM  (0x100) - tells the kernel to let the original process and the clone in the same memory space;
CLONE_FS (0x200) - both get the same file system information;
CLONE_FILES (0x400) - share file descriptors;
CLONE_SIGHAND (0x800) - both processes share the same signal handlers;
CLONE_THREAD (0x10000) - this tells the kernel, that both processes would belong to the same thread group (be threads within the same process);

SIGCHLD (0x11) - this is not a flag, this is the number of the SIGCHLD signal, which would be sent to the original process (thread) when the thread is terminated (used by wait functions).

newsp - the value of the stack pointer for the cloned process (new thread). This value may be NULL in which case, both threads are using the same stack. However, if the new thread attempts to write to the stack, then, due to the copy-on-write mechanism, it gets new memory pages, thus, leaving the stack of the original thread untouched.

Stack Allocation
Due to the fact, that in most cases, you would want to allocate a new stack for a new thread, I cannot leave this aspect uncovered in this article. To make things easier, let us implement a small function, which would receive the size of the requested  stack in bytes and return a pointer to the allocated memory region.

Important note:
As Linux follows AMD64 calling convention when running in 64 bits, function parameters and system call arguments are passed via the following registers:
Function call: arguments 1 - 6 via RDI, RSI, RDX, RCX, R8, R9; additional arguments are passed on stack.
System call: arguments 1 - 6 via RDI, RSI, RDX, R10, R8, R9; additional arguments are passed on stack.

C declaration:
void* map_stack(unsigned long stack_size);

PROT_READ     = 1
MAP_PRIVATE   = 0x002
MAP_ANON      = 0x020
SYS_MMAP      = 9

   push  rdi rsi rdx r10 r8 r9                 ;Save registers
   mov   rsi, rdi                              ;Requested size
   xor   rdi, rdi                              ;Preferred address (may be NULL)   
   mov   rdx, PROT_READ or PROT_WRITE          ;Memory protection
   mov   r10, MAP_PRIVATE or MAP_ANON or MAP_GROWSDOWN ;Allocation attributes
   xor   r8, r8                                ;File descriptor (-1)
   dec   r8     
   xor   r9, r9                                ;Offset - irrelevant, so 0
   mov   rax, SYS_MMAP                         ;Set system call number
   syscall                                     ;Execute system call
   pop   r9 r8 r10 rdx rsi rdi                 ;Restore registers

Calling this function would be as easy as:

mov  rdi, size
call map_stack

This function returns either a negative error code as provided by sys_mmap or the address of the allocated memory region. As we specified MAP_GROWSDOWN attribute, the obtained address would point to the top of the allocated region instead of pointing to its bottom, thus, making it perfect to specify as a new stack pointer.

Creation of Thread
In this section, we will implement a trivial create_thread function. It would allocate stack (of default size = 0x1000 bytes) for a new thread, invoke sys_clone and to either the instruction following call create_thread or to the thread function, depending on the return value of sys_clone.

C declaration:
long create_thread(void(*thread_func)(void*), void* param);

As you may see, the return type of the thread_func is void, unlike the real clone function. I will show you why a bit later.

   mov   r14, rdi    ;Save the address of the thread_func
   mov   r15, rsi    ;Save thread parameter
   mov   rdi, 0x1000 ;Requested stack size
   call  map_stack   ;Allocate stack
   mov   rsi, rax    ;Set newsp
   mov   rdi, CLONE_VM or CLONE_FS or CLONE_THREAD or CLONE_SIGHAND or SIGCHLD ;Set clone_flags
   xor   r10, r10    ;parent_tid
   xor   r8, r8      ;child_tid
   xor   r9, r9      ;regs
   mov   rax, SYS_CLONE
   syscall           ;Execute system call
   or    rax, 0      ;Check sys_clone return value
   jnz   .parent     ;If not 0, then it is the ID of the new thread
   push  r14         ;Otherwise, set new return address (thread_func)
   mov   rdi, r15    ;Set argument for the thread_func
   ret               ;Return to thread_func
   ret               ;Return to parent (main thread)

Exiting Thread
Everyone who has ever searched the Web for Assembly programming tutorial for Linux is familiar with sys_exit system call. On 64 bit Intel platform it is call number 60. However, they all (tutorials) miss the point. Although, sys_exit works perfectly with single threaded hello-world-like applications, the situation is different with multithreaded ones. In general, sys_exit terminates thread, not a process, which, in case of a process with a single thread, is definitely enough, but may lead to strange artifacts (or even zombies) if, for example, a thread continues to print to stdout after you have terminated the main thread.

Now, the promised explanation on the the thread_func return type. In our case (as in most cases) the thread_func does not return by means of using the ret instruction. It just can't as there is no return address on the stack and even if you put one - returning would not terminate the thread. Instead, you should implement something like this exit_thread function.

C declaration:
void exit_thread(long result);

                         ; Result is already in RDI
   mov   rax, SYS_EXIT   ; Set system call number
   syscall               ; Execute system call

Exiting Process
By exiting process we usually mean total termination of the running process. Linux gracefully provides us with a system call which terminates a group of threads (process) - sys_exit_group (call number 231). The function for terminating the process is as simple as this:

C declaration:
void exit_process(long result);

                             ; Result is already in RDI
   mov   rax, SYS_EXIT_GROUP ; Set system call number
   syscall                   ; Execute system call

Attached Source Code
The source code attached to this article (which may be found here) contains a trivial example of the application that creates thread with the method described above. In addition, it contains the list of system call numbers for both 32 and 64 bit platforms.

Note for Nerds:
The attached code is for demonstration purpose only and may not contain such important elements as checking for errors, etc.

32 bit Systems
If you decide to convert the code given above to run on 32 bit systems, that would be quite easy. First of all - change register names to appropriate 32 bit ones. 

Second thing is to remember how parameters are passed to system calls in 32 bit kernels. They are still passed through registers, but the registers are different. Parameters 1st through 5th are passed through EBX, ECX, EDX, ESI, EDI. The system call number is placed as usual in EAX, the same register is used to store return value upon system call's completion.

Third - use int 0x80 instead of syscall instruction.

Forth - remember to change function prologues due to a different calling convention. While 64 bit systems use AMD64 ABI, 32 bit systems use cdecl passing arguments on stack by default.

Hope this article was interesting and helpful.

See you at the next (remote threads in Linux - stay tuned).

Tuesday, March 6, 2012

Faking KERNEL32.DLL - an Amateur Sandbox

As a part of my work (read "fun") of maintaining this blog, I am constantly checking the statistic information on traffic sources and keywords (it's nice to know that people are getting here via Google) in order to see whether my readers are getting what they are looking for (personally, I see no reason in simply "streaming my consciousness to the masses" as this is not the point of this blog). Sometimes, it gives an idea of what is missing but still related to system and low level programming.

A couple of days ago, I saw that someone was looking for a way to load and use fake KERNEL32.dll and I realized that this information has not yet been covered here. There is no source code for this article as I am a bit short on time to write it, but I will do my best to provide as much information as possible so, those who would want to try it would have no problem doing that.

First notable thing about KERNEL32.dll is that it is always loaded, regardless of whether a running executable imports anything from it (this is briefly covered here). Same as NTDLL.dll (well, KERNEL32.dll imports from it). This library provides interfaces for interaction with deeper levels of "user land" part of the operating system for the running executable and some of other dynamic link libraries loaded into process' memory. 

Knowing all that, the first thought may be: "how are we going to fake it if all the rest depends on it?". The solution is easier than one could think at first. However, we should keep in mind, that some programs may import from NTDLL.dll directly, bypassing the KERNEL32.dll (which used to happen quite often in the world of malware), meaning that once you faked KERNEL32.dll, you may have to fake NTDLL,as well.

We should start with writing a good old simple DLL/code injector. It is easier to dissect the victim process from inside. This is the simplest part and it is covered in this and this posts of this blog. Roughly saying, the injector should be able to create a victim process in suspended state by passing the CREATE_SUSPENDED process creation flag to CreateProcess API.

Writing the code or the DLL we are going to inject is a harder task as this code is intended to perform the tasks described below in order of execution.

Load Fake KERNEL32.dll
Let's assume, that we already have a ready to use fake KERNEL32.dll (we'll get back to creation of fake dll a bit later). This is quite simple - call LoadLibrary function from your code. One thing worth mentioning is that MSDN is not suggesting to use LoadLibrary in your DllMain function. Therefore, if you decide to use DLL Injection instead of code injection, then better use the approach described in "Advanced DLL Injection" article. 

Fake KERNEL32.dll should simply import all API's from the original one. Don't be mistaken - import, not forward it's exports at least as long as we are talking about API functions, but you may safely forward exported objects and variables to the original one.

Resolve Victim's Imports
By the time we get our code/DLL running inside the suspended victim process, all of it's imports should already have been resolved. What we still have to do, is to replace all API addresses exported from the original KERNEL32.dll with corresponding addresses in our fake one.

Here is a link to Microsoft's specifications of MS PE and MS COFF file formats - would be useful digging through imports and export.

Hide the Original KERNEL32.dll
While performing the aforementioned actions may be enough in case of a regular application, we should take some precautions in case of malicious code. My suggestion is to hide the original KERNEL32.dll by replacing its entry in the list of LDR_MODULE structures in PEB with the one describing our fake KERNEL32.dll, just like we would hide an injected DLL in the "Hiding Injected DLL in Windows" article.

Creation of Fake KERNEL32.dll
This may sound scary, but there is no need to worry (at least not too much). All that we need in order to create one, is a C compiler (or whatever high level language you prefer) and any assembler (I use FASM as usual).

Dump KERNEL32.dll to ASM Source
No, of course we do not have to disassemble the whole DLL and dump it to a corresponding Assembly source. Instead, what we have to do, is write a small application in high level language (you may try to do it in Assembly if you want) that would parse the export table of the original KERNEL32.dll and create a set of Assembly source files: one for code, one for data (if needed), one for import and one for export sections.

Want it or not, but the application has to generate a bit of Assembly code for at least transferring the execution flow to an API function in the original KERNEL32.dll. For example, if we have no interest in, let's say, ExitProcess, then our fake ExitProcess should look similar to this:

   ; As we are not tracing/logging this function, we simply let the
   ; original ExitProcess shoot
   jmp dword [real_ExitProcess]

However, the code would be different for APIs of interest. For example, the CreateFileA API would be implemented like this:

   ; We pass control to a fake CreateFileA, which is implemented in
   ; a separate DLL imported by our fake KERNEL32.dll
   ; Parameters are already on the stack, so we simply jump.
   ; Don't forget to declare the fake function as STDCALL 
   ; (remember #define WINAPI __declspec(stdcall) ? )
   jmp dword [our_CreateFileA]

The Assembly source file containing code for the import section would then contain the following:

section '.idata' import data readable writable
   library original, 'kernel32.dll',\            ;Import original KERNEL32.dll
           fake,     'our_dll_with_fake_api.dll' ;Import a DLL with fake APIs

   import original,\
      real_ExitProcess, 'ExitProcess'

   import fake,\
      our_CreateFileA, 'whatever you call it here'

Now, finally, we get to the export section's code:

section '.edata' export data readable
   export 'KERNEL32.dll',\              ;FASM does not care about what you type here, 
                                        ;so let's be fake to the end and pretend 
                                        ;to be KERNEL32.dll
      fake_ExitProcess, 'ExitProcess',\
      fake_CreateFileA, 'CreateFileA'

Finally the main source file, the one that would assemble all the rest together:

format PE DLL at 0x10000000

include 'code.asm'
include 'import.asm'
include 'export.asm'

section '.reloc' fixups data discardable

compile it with FASM and you have your fake KERNEL32.dll.

Implementation of Fake API
As it has been mentioned above, there are some functions we would want to trace. Those should have some custom implementation, preferably in a separate DLL (which would be loaded by Windows loader at the time it resolves our fake KERNEL32.dll's dependencies). Below is a diagram of the interactions between all the modules:
Interactions between modules involved in faking.

And here is an example of such fake API:

HANDLE WINAPI fake_CreateFileA(
                               LPCSTR lpFileName,
                               DWORD dwDesiredAccess,
                               DWORD dwShareMode,
                               LPSECURITY_ATTRIBUTES lpSA,
                               DWORD dwCreationDisposition,
                               DWORD dwFlagsAndAttributes,
                               HANDLE hTemplateFile)
   fprintf(log_file, "CreateFileA(list params here)\n", params);
   return CreateFileA(lpFileName, 

Of course, you may implement addition mechanisms within this DLL, e.g. let it communicate with another application via sockets or pipes, but this is deserves a separate article.

My personal suggestion is to insert more code into each function inside the fake KERNEL32.dll so that it would look more realistic to the victim process (should it try to do anything with it).

Hope this article was useful. 

See you at the next.

Sunday, March 4, 2012

Trivial Artificial Neural Network in Assembly Language

Source code for this article may be found here.

Note for nerds: The code shown in this article may be incomplete and may not contain all the security checks you would usually perform in your code as it is given here for demonstration purposes only. Downloadable source code may contain bugs (there is no software without bugs at all). It is provided as is without any warranty. You may use and redistribute it as you wish while mentioning this site and the author.

I was recently digging through my sources and came across a small ANN (artificial neural network) library I wrote several months ago in 64 bit Intel Assembly language (FASM syntax) and decided to share it with my respected readers hoping that it may be useful in some cases.

Artificial Neural Network
Internet is full of articles covering this topic either in general or in depth. Personally, I would prefer not to create yet another clone with pictures of Synapses, etc. In short ANN is a computational model inspired by the way our brain seems to work. There is a good article on Wikipedia giving quite a good amount of explanations. It seems to be important to mention, that saying "ANN" people usually think of Perceptron or Multilayer Perceptron, but there are much more types out there. You should check out this article if you are interested. 

However, this article covers implementation of Multilayer Perceptron in Assembly language, which appears to be easier than it sounds. The library is appropriate for creation of multilayer perceptrons with any number of hidden layers, any number of input and output neurons, although, it is bound to 64 bit Linux, I will try to explain how you can change the code to make it compatible with 64 Windows, but it would take much more effort to actually rewrite the whole thing to run on 32 bit platforms.

This is the basis of the whole project. Neuron is the main part of the calculation. In this example, all neurons are arranged into a linked list, having input neurons at the beginning of the list and output neurons at its end. It is important to mention that they would all be processed in the same order they appear in the linked list. First of all, let's define a structure, that would contain all the information we need for a single neuron:

struc list
   .prev_ptr    dq ?
   .next_ptr    dq ?

struc neuron
   .list        list  ;Pointers to previous and next neurons
   .input       dq ?  ;Pointer to the first input synapse
   .output      dq ?  ;Pointer to the first output synapse
   .value       dq ?  ;Resulting value of the neuron
   .signal      dq ?  ;Error signal
   .sum         dq ?  ;Sum of all weighted inputs
   .bias        dq ?  ;Bias weight (threshold)
   .bias_delta  dq ?  ;Bias weight delta
   .index       dw ?  ;Index of the given neuron
   .num_inputs  dw ?  ;Number of input synapses
   .num_outputs dw ?  ;Number of output synapses
   .type        dw ?  ;Type of the neuron (bit field)
   .size        = $ - .list

Figure 1 shows the arrangement of neurons in a perceptron designed to perform XOR operation. It has 2 input neurons, three neurons in the hidden layer and two output neurons. Arrows show the order of processing.

Figure 1
I implemented this perceptron with 2 output neurons for testing purposes only, as it could well be implemented with a single output neuron, where output value > 0.5 would be 1 and below would be 0.

There would be no perceptron without synaptic links. This is where the following structure appears on the scene.

struc synaps
   .inputs        list   ;Pointers to previous and next input synapses
                         ;if such exist
   .outputs       list   ;Pointers to previous and next output synapses
                         ;if such exist
   .value         dq ?   ;Value to be transmitted
   .weight        dq ?   ;Weight of the synapse
   .delta         dq ?   ;Weight delta
   .signal        dq ?   ;Error signal
   .input_index   dw ?   ;Index of the input neuron in the list of neurons
   .output_index  dw ?   ;Index of the output neuron in the list of neurons
                  dd ?   ;Alignment
   .size          = $ - .inputs

At first, it may be a bit hard to understand why there are so many pointers in both structures. Unfortunately, my verbal abilities are far from being perfect (especially, given that English is not my mother tongue), therefore, let me illustrate the way neurons are interconnected with synapses in this implementation first (in a hope that my graphic abilities are not worse then verbal).
Figure 2

Figure 2 shows that each neuron (except the output ones) has a pointer (neuron.output) to a list of synapses that need to be fed with this neuron's calculated value. For a neuron, its output synapses are linked with synaps.outputs pointers. In turn, each neuron (except the input ones) has a pointer (neuron.input) to a list of synapses to collect inputs from. On the figure, each gray arrow goes from a neuron in the left layer to a neuron in the right layer through the corresponding synaptic link.

Processing a Single Neuron
Each neuron in the network is processed with the same function which prototype in C is like this:

void neuron_process(neuron_t* n, int activation_type);

where n is a pointer to the neuron we want to process and activation_type specifies which activation function should be used. As I have mentioned above, this implementation only has one activation function - logistic (aka exponential):

f(x) = 1.0 / (1.0 + exp(-2 * x))

The following piece of code is an Assembly implementation of EXP():

;double exp(double d)
   push   rbp
   mov    rbp, esp
   sub    rsp, 8
   push   rbx rcx rdx rdi rsi
   movsd  qword [rbp-8], xmm0
   fld    qword [rbp-8]
   fmulp  st1, st0
   fld    st0
   fsub   st1, st0
   fxch   st1
   faddp  st1, st0
   fstp   st1
   fstp   qword [rbp-8]
   movsd  xmm0, qword [rbp-8]
   pop    rsi rdi rdx rcx rbx
   add    rsp, 8

Now the x itself. x is a sum of products of value and weight of all input synaptic links plus bias weight of a neuron. The result of f() is then stored to each and every output synaptic link (if not output neuron) in accordance to the diagram shown on figure 3:

Figure 3
The Net
We are almost done with building out net. Let's define a structure that would incorporate all the information about our perceptron and all values needed for training and execution:

struc net
   .neurons     dq ? ;Pointer to the first neuron in the linked list of neurons
   .outs         dq ? ;Pointer to the first output neuron
   .num_neurons  dd ? ;Total amount of neurons
   .activation   dd ? ;Which activation method to use (we only have one here)
   .qerror       dq ? ;Mean quadratic error
   .num_inputs   dw ? ;Number of input neurons
   .num_outputs  dw ? ;Number of output neurons
   .rate         dq ? ;Learning rate regulates learning speed
   .momentum     dq ? ;Roughly saying - error tolerance
   .size         = $ - .neurons

The Fun
The source code attached to this article implements all the functions needed to manipulate the network as needed. All functions are exported by the library and described in "ann.h". However, we only need to deal with few of them:

net_t* net_alloc(void);
This function allocates  the net_t object and returns a pointer.

void net_fill(net_t* net, int number_of_neurons, int number_of_inputs, int number_of_outputs);
This function populates the net with requested amount of neurons and sets all values and pointers accordingly.

void net_set_links(net_t* net, int* links);
This function is responsible for setting up all the synaptic links between neurons. While net is a pointer to previously allocated net_t structure, links is a pointer to the array of integer pairs terminated by a pair of 0's:

int pairs[][2]={
   {1, 3},
   {1, 4},
   {2, 4},
   {2, 5},
   {3, 6},
   {3, 7},
   {4, 6},
   {4, 7},
   {5, 6},
   {5, 7},
   {0, 0}};

The above array is exactly the one used in the supplied test application in order to set up links as shown on figure 3.

double net_train(net_t* net, double* values, double* targets);
This function is responsible for everything that is needed in order to train our perceptron using the back-propagation training paradigm. Returns mean quadratic error of all output neurons (which is also accessible through net->qerror).

values - array of values to be fed into the net prior to running it (the function does not check whether the length of the array is appropriate, so your code is responsible for that);
targets - array of expected results.

Due to the fact that we are using logistic activation function, it is necessary to normalize the input data to be in the range [0 < x < 1] (outputs would be in the same range as well).

Run this function as many times as needed in order to get a reasonable error. You will have to "play" with rate and momentum parameters to get best values, but you may start with 0.9 for rate and 0.02 for momentum. It is important to specify those values as the library does not check whether they are set or not!

void net_run(net_t* net, double* values);
This function is used in order to run the net. 

values - same as in case of net_train function;

This function does not return a value, so you have to manually access net->outs.

Attached Source Code
The attached source code may be compiled with flat assembler:

   fasm libann.asm libann.o

and linked to any C code compiled with GCC on 64 bit Linux.

Making the Code Windows Compatible
This requires a piece of work. First of all, you would have to edit the libann.asm file, changing the format ELF64 to format MS64 COFF and sections' attributes accordingly. You would also have to make some changes to the code. 64 bit Linux uses AMD64 ABI, while Microsoft has its own. Major differences are in how parameters are passed to functions. While in Linux they are passed via RDI, RSI, RDX, RCX, R8 and R9 (all the rest on the stack in reverse order) registers for integers and XMM0 - XMM7 for doubles, Microsoft uses RCX, RDX, R8 and R9 for integers and XMM0 - XMM3 for doubles and any additional parameters are passed on stack in reverse order.

The output for XOR problem should look similar to this:


Thanks for reading! Hope this article was interesting and may be even helpful.

See you at the next!