Search This Blog

Showing posts with label information. Show all posts
Showing posts with label information. Show all posts

Sunday, March 4, 2012

Trivial Artificial Neural Network in Assembly Language

Source code for this article may be found here.

Note for nerds: The code shown in this article may be incomplete and may not contain all the security checks you would usually perform in your code as it is given here for demonstration purposes only. Downloadable source code may contain bugs (there is no software without bugs at all). It is provided as is without any warranty. You may use and redistribute it as you wish while mentioning this site and the author.

I was recently digging through my sources and came across a small ANN (artificial neural network) library I wrote several months ago in 64 bit Intel Assembly language (FASM syntax) and decided to share it with my respected readers hoping that it may be useful in some cases.

Artificial Neural Network
Internet is full of articles covering this topic either in general or in depth. Personally, I would prefer not to create yet another clone with pictures of Synapses, etc. In short ANN is a computational model inspired by the way our brain seems to work. There is a good article on Wikipedia giving quite a good amount of explanations. It seems to be important to mention, that saying "ANN" people usually think of Perceptron or Multilayer Perceptron, but there are much more types out there. You should check out this article if you are interested. 

However, this article covers implementation of Multilayer Perceptron in Assembly language, which appears to be easier than it sounds. The library is appropriate for creation of multilayer perceptrons with any number of hidden layers, any number of input and output neurons, although, it is bound to 64 bit Linux, I will try to explain how you can change the code to make it compatible with 64 Windows, but it would take much more effort to actually rewrite the whole thing to run on 32 bit platforms.

Neuron
This is the basis of the whole project. Neuron is the main part of the calculation. In this example, all neurons are arranged into a linked list, having input neurons at the beginning of the list and output neurons at its end. It is important to mention that they would all be processed in the same order they appear in the linked list. First of all, let's define a structure, that would contain all the information we need for a single neuron:

struc list
{
   .prev_ptr    dq ?
   .next_ptr    dq ?
}

struc neuron
{
   .list        list  ;Pointers to previous and next neurons
   .input       dq ?  ;Pointer to the first input synapse
   .output      dq ?  ;Pointer to the first output synapse
   .value       dq ?  ;Resulting value of the neuron
   .signal      dq ?  ;Error signal
   .sum         dq ?  ;Sum of all weighted inputs
   .bias        dq ?  ;Bias weight (threshold)
   .bias_delta  dq ?  ;Bias weight delta
   .index       dw ?  ;Index of the given neuron
   .num_inputs  dw ?  ;Number of input synapses
   .num_outputs dw ?  ;Number of output synapses
   .type        dw ?  ;Type of the neuron (bit field)
   .size        = $ - .list
}

Figure 1 shows the arrangement of neurons in a perceptron designed to perform XOR operation. It has 2 input neurons, three neurons in the hidden layer and two output neurons. Arrows show the order of processing.

Figure 1
I implemented this perceptron with 2 output neurons for testing purposes only, as it could well be implemented with a single output neuron, where output value > 0.5 would be 1 and below would be 0.

Synapse
There would be no perceptron without synaptic links. This is where the following structure appears on the scene.

struc synaps
{
   .inputs        list   ;Pointers to previous and next input synapses
                         ;if such exist
   .outputs       list   ;Pointers to previous and next output synapses
                         ;if such exist
   .value         dq ?   ;Value to be transmitted
   .weight        dq ?   ;Weight of the synapse
   .delta         dq ?   ;Weight delta
   .signal        dq ?   ;Error signal
   .input_index   dw ?   ;Index of the input neuron in the list of neurons
   .output_index  dw ?   ;Index of the output neuron in the list of neurons
                  dd ?   ;Alignment
   .size          = $ - .inputs
}


At first, it may be a bit hard to understand why there are so many pointers in both structures. Unfortunately, my verbal abilities are far from being perfect (especially, given that English is not my mother tongue), therefore, let me illustrate the way neurons are interconnected with synapses in this implementation first (in a hope that my graphic abilities are not worse then verbal).
Figure 2

Figure 2 shows that each neuron (except the output ones) has a pointer (neuron.output) to a list of synapses that need to be fed with this neuron's calculated value. For a neuron, its output synapses are linked with synaps.outputs pointers. In turn, each neuron (except the input ones) has a pointer (neuron.input) to a list of synapses to collect inputs from. On the figure, each gray arrow goes from a neuron in the left layer to a neuron in the right layer through the corresponding synaptic link.

Processing a Single Neuron
Each neuron in the network is processed with the same function which prototype in C is like this:

void neuron_process(neuron_t* n, int activation_type);

where n is a pointer to the neuron we want to process and activation_type specifies which activation function should be used. As I have mentioned above, this implementation only has one activation function - logistic (aka exponential):

f(x) = 1.0 / (1.0 + exp(-2 * x))

The following piece of code is an Assembly implementation of EXP():

;double exp(double d)
exp:
   push   rbp
   mov    rbp, esp
   sub    rsp, 8
   push   rbx rcx rdx rdi rsi
   movsd  qword [rbp-8], xmm0
   fld    qword [rbp-8]
   fld2e
   fmulp  st1, st0
   fld    st0
   frndint
   fsub   st1, st0
   fxch   st1
   f2xm1
   fld1
   faddp  st1, st0
   fscale
   fstp   st1
   fstp   qword [rbp-8]
   fwait
   movsd  xmm0, qword [rbp-8]
   pop    rsi rdi rdx rcx rbx
   add    rsp, 8
   leave
   ret

Now the x itself. x is a sum of products of value and weight of all input synaptic links plus bias weight of a neuron. The result of f() is then stored to each and every output synaptic link (if not output neuron) in accordance to the diagram shown on figure 3:

Figure 3
.
The Net
We are almost done with building out net. Let's define a structure that would incorporate all the information about our perceptron and all values needed for training and execution:

struc net
{
   .neurons     dq ? ;Pointer to the first neuron in the linked list of neurons
   .outs         dq ? ;Pointer to the first output neuron
   .num_neurons  dd ? ;Total amount of neurons
   .activation   dd ? ;Which activation method to use (we only have one here)
   .qerror       dq ? ;Mean quadratic error
   .num_inputs   dw ? ;Number of input neurons
   .num_outputs  dw ? ;Number of output neurons
   .rate         dq ? ;Learning rate regulates learning speed
   .momentum     dq ? ;Roughly saying - error tolerance
   .size         = $ - .neurons
}
   

The Fun
The source code attached to this article implements all the functions needed to manipulate the network as needed. All functions are exported by the library and described in "ann.h". However, we only need to deal with few of them:

net_t* net_alloc(void);
This function allocates  the net_t object and returns a pointer.

void net_fill(net_t* net, int number_of_neurons, int number_of_inputs, int number_of_outputs);
This function populates the net with requested amount of neurons and sets all values and pointers accordingly.

void net_set_links(net_t* net, int* links);
This function is responsible for setting up all the synaptic links between neurons. While net is a pointer to previously allocated net_t structure, links is a pointer to the array of integer pairs terminated by a pair of 0's:

int pairs[][2]={
   {1, 3},
   {1, 4},
   {2, 4},
   {2, 5},
   {3, 6},
   {3, 7},
   {4, 6},
   {4, 7},
   {5, 6},
   {5, 7},
   {0, 0}};

The above array is exactly the one used in the supplied test application in order to set up links as shown on figure 3.

double net_train(net_t* net, double* values, double* targets);
This function is responsible for everything that is needed in order to train our perceptron using the back-propagation training paradigm. Returns mean quadratic error of all output neurons (which is also accessible through net->qerror).

values - array of values to be fed into the net prior to running it (the function does not check whether the length of the array is appropriate, so your code is responsible for that);
targets - array of expected results.

Due to the fact that we are using logistic activation function, it is necessary to normalize the input data to be in the range [0 < x < 1] (outputs would be in the same range as well).

Run this function as many times as needed in order to get a reasonable error. You will have to "play" with rate and momentum parameters to get best values, but you may start with 0.9 for rate and 0.02 for momentum. It is important to specify those values as the library does not check whether they are set or not!

void net_run(net_t* net, double* values);
This function is used in order to run the net. 

values - same as in case of net_train function;

This function does not return a value, so you have to manually access net->outs.


Attached Source Code
The attached source code may be compiled with flat assembler:

   fasm libann.asm libann.o

and linked to any C code compiled with GCC on 64 bit Linux.

Making the Code Windows Compatible
This requires a piece of work. First of all, you would have to edit the libann.asm file, changing the format ELF64 to format MS64 COFF and sections' attributes accordingly. You would also have to make some changes to the code. 64 bit Linux uses AMD64 ABI, while Microsoft has its own. Major differences are in how parameters are passed to functions. While in Linux they are passed via RDI, RSI, RDX, RCX, R8 and R9 (all the rest on the stack in reverse order) registers for integers and XMM0 - XMM7 for doubles, Microsoft uses RCX, RDX, R8 and R9 for integers and XMM0 - XMM3 for doubles and any additional parameters are passed on stack in reverse order.

Output
The output for XOR problem should look similar to this:

Output

Thanks for reading! Hope this article was interesting and may be even helpful.

See you at the next!



Friday, March 2, 2012

Defeating Packers for Static Analysis of Malicious Code

I doubt whether there is anybody in either AV industry or among reverse engineers who does not know what a software packer is (for those who don't - this article may help). Malware research and reverse engineering forums are full of packers' related questions, descriptions thereof, unpacking suggestions and links to both packers and unpackers. In short - people have been doing a lot of precious work on defeating packers and protectors.

However, for those of us who are not afraid of static analysis, there is an easier way (I'd dare to say "generic") to handle packers and protectors and retrieve the unpacked form of the executable (cannot hold my self from adding a note for nerds: no, this does not include reversing virtual machines like Oreans' one. This is up to you). So, the main problem is obtaining the unpacked version of code as all the rest may be well handled from there. What we actually need is a dump of unpacked executable. There are lots of memory dumping programs, but some protectors "know" how to handle them, therefore, this article explains a simple and short way of obtaining such dump without teasing the implemented protections.

Knock Knock
First of all, we need to, let's say, get into the process. There are at least two ways to do that:

  • Use the OpenProcess Windows API with, preferably, PROCESS_ALL_ACCESS and read/write from/to process' memory.
  • Inject our code into the process' memory space (simply a code injection or a DLL injection).
My preference is the second one as you mostly have more power operating from inside.

There are several ways to inject a DLL into another process, e.g. calling LoadLibraryA as a remote thread in the victim process or even this one (my preference is the second one again). This is in deed the easiest part. My personal suggestion would be to create a suspended process, inject your DLL and then resume the main thread of the created process as this provides you with greater flexibility.

Set the Trap
No, I am not referring to 0xCC (trap to debugger). Trap, in this case, means something that would trigger the dumper embedded into the injected DLL and cause it to dump the unpacked image, for example, patch one of the API functions with a jmp instruction, which would redirect the execution flow to where we want. Be careful with this approach, as your patch may be well "overpatched" by the protection mechanism of the target executable. Let me give you a couple of suggestions: never patch the first bytes of the API (I assume this code is not a production code, so you may let it be bound to your version of Windows); patch as deep as possible - meaning leave the kernel32.dll alone and go further to ntdll.dll where possible. 

For example, if your target executable outputs a string to a console, that may be a good idea to patch either WriteConsoleA or WriteConsoleW API  function. However, it may be an even better idea to go deeper and patch WriteConsoleInternal (Win7) and install your notification jump there. Once that API is called - chances are that the executable has been fully unpacked. As an alternative, you may simply create a new thread in your DLL and Sleep it for several milliseconds (or even seconds) and then dump the memory.

You may perform these actions in the DllMain of your injected DLL, on the other hand, you may create a separate procedure for this, but then you'd have to use this approach or something similar.

Dumper Triggered
No matter how (either by the API patch or our thread) the dumper is triggered. Sure thing - we are not going to dump the whole memory allocated by the process. We just need the image. The easiest way of getting the information on ImageBase and SizeOfImage of the target module (usually the main module of the process) is to find the corresponding entry in PEB (you may want to check the "Hiding Injected DLL in Windows" post to get more information on PEB and related structures). However, it is important to mention that you HAVE to be careful with that, as the information in PEB may be altered by the protection scheme of the victim executable. Having found the base address and the size of the image, just write the content of that memory region to a file (make sure to take note of image's base address if you are dumping DLL). Quite simple, isn't it? Well, not really. You have to check for memory protection of every region you are currently going to dump as it may have either PAGE_WRITE or PAGE_EXECUTE access rights only, meaning that you cannot access it for reading. Once done with this, you may either let the program execution to continue or terminate the process.

In addition, it is strongly recommended to suspend all the threads of the process, except the thread our code is running in.

Using Dump
Nothing's easier - load the dump into IDA Pro and see how good it handles it.

P.S. Suspending/Resuming Threads
Suspending threads is a bit annoying as you have to get the IDs of all the threads currently running in the system, then select those with process ID of your process and suspend then. The same procedure is applicable for resuming suspended threads.

First of all CreateToolhelp32Snapshot (MSDN):

HANDLE WINAPI CreateToolhelp32Snapshot(
              DWORD wdFlags,
              DWORD th32ProcessID
       );

You have to specify TH32_SNAPTHREAD as flags in order to get threads information. If the return value is not NULL, the you may proceed with Thread32First:

BOOL WINAPI Thread32First(
            HANDLE          hSnapshot,
            LPTHREADENTRY32 lpte
            );

followed by subsequent calls to Thread32Next (has the same arguments) until the return value is FALSE.

The functions Thread32First and Thread32Next fill the THREADENTRY32 structure which has the following format:

typedef struct tagTHREADENTRY32
{
   DWORD dwSize; /* Should be set to sizeof(THREADENTRY32) prior 
                    to calling Thread32First */
   DWORD cntUsage;
   DWORD th32ThreadID;
   DWORD th32OwnerProcessID;
   LONG  tpBasePri;
   LONG  tpDeltaPri;
   DWORD dwFlags;
} THREADENTRY32, *PTHREADENTRY32;

Fields of interest for you would be the th32OwnerProcessID and the th32ThreadID. Compare the th32OwnerProcessID with the ID of the process (previously obtained with GetCurrentProcessId()) your code is running in. If those values are equal, then you have to open the thread with:

HANDLE WINAPI OpenThread(
              DWORD dwDesiredAccess, /* Would be THREAD_ALL_ACCESS */
              BOOL  bInheritHandle, /* FALSE */
              DWORD dwThreadId      /* th32ThreadID */
              );

Then suspend the thread with:

DWORD WINAPI SuspendThread( HANDLE hThread );

while passing the handle obtained with OpenThread().
You have to resume threads once you have saved the dump by calling:

DWORD WINAPI ResumeThread( HANDLE hThread );

Don't forget to close each thread handle with CloseHandle.


That's it. Hope this post was helpful (at least I used this method a lot).
See you at the next.


Monday, December 19, 2011

Listing Loaded Shared Objects in Linux

I have recently come across several posts on the Internet where guys keep asking for Linux analogs of Windows API. One of the most frequent one is something like "EnumProcessModules for Linux". As usual, most of the replies are looking like "why do you need that?" or "Linux is not Windows". Although, the last one is totally true, it is completely useless. As to "why do you need that?" - why do you care? Poor guy's asking a question here so let's assume he knows what he's doing.

I remember looking for something like this myself while working on some virtualization project for one of my previous employers. One thing I've learnt - once the question is out of ordinary (and people do not usually ask for Windows API replacements in Linux), there is a really good chance of getting tones of useless replies and blamed for being unclear. More then that, as long as it comes to Linux, most people do not really understand the difference between doing something in the shell and doing something in your program (as, unfortunately, many call shell scripts programs as well).

Well, enough crying here. Let's get to business. As usual, a note for nerds (non nerds are welcome to comment, leave suggestions, etc.)

  • the code in this article may not contain all necessary checks for invalid values;
  • yes, there are other ways of doing this;
  • you are going to mess with libc here, so be careful;
What are Modules (in this case)
In Linux the word "module" has a different meaning from what you've been used to in Windows. While in Windows this word means components of a process (main executable and all loaded DLLs), in Linux it refers to a part of the kernel (usually a driver). If this what you mean, then you probably want to enumerate loaded kernel modules and this is beyond the scope of this article. What we are going to do here, is to write to the terminal paths of all loaded shared objects (Linux analog of Windows DLL) and we are going to do it in a less common way just to see how things are organized internally. Just like we have LDR_MODULE structure in Windows, we have link_map structure in Linux. In both cases these structures describe loaded libraries (well, in Windows there's also a LDR_MODULE for the main executable).

link_map Structure
We do not need to know too much about this structure (for those interested - see "include/link.h" in your glibc sources). We may even define our own structure for that (a minimal one):

struct lmap
{
   void*    base_address;   /* Base address of the shared object */
   char*    path;           /* Absolute file name (path) of the shared object */
   void*    not_needed1;    /* Pointer to the dynamic section of the shared object */
   struct lmap *next, *prev;/* chain of loaded objects */
}

There is some more information in the original structure, but we do not need it for now.

Getting There
So we know what the link_map structure looks like and it is good. But how can we get there? let me assume that you are aware of dynamic linking. In Linux we have dl* functions:

dlopen - loads a shared object (LoadLibrary);
dlclose - unloads a shared object (FreeLibrary);
dlsym - gets the address of a symbol from the shared object (GetProcAddress).

The dlopen function returns a pseudo handle to the loaded shared object. These functions are declared in "dlfcn.h". You also have to explicitly link the dl library by passing -ldl to gcc.

While in Windows HANDLE is equal to the base address of the module, in Linux pseudo handle is in fact a pointer to the corresponding link_map structure. This means that getting to the head of the list of loaded modules is quite easy:

struct lmap* get_list_head(void* handle)
{
   struct lmap* retval = (struct lmap*)handle;
   while(NULL != retval->prev->path)
      retval = retval->prev;
   return retval;
}

Things are a bit more complicated if you do not intend to load any shared object. You will still have to use the dl library, though.

First of all, you will have to call dlopen, despite that fact that you are not going to load anything. Call it with NULL passed as first argument and RTLD_NOW as the second. The return value in this case would be the pseudo handle for the main executable (similar to GetModuleHandle(NULL) in Windows), but it would point to a different structure (to be honest, I've been too lazy to dig for it in libc sources) then link_map. This structure contains different pointers and we are particularly interested in the fourth one. This pointer points to a structure (which I was too lazy to dig for as well) with some other pointers/values and we are particularly interested (again) in the fourth one. This pointer, in turn, finally gets us to the first link_map structure. In my case, it is a structure which refers to libdl.so.2. Let's take a look at the procedure in C

struct something
{
   void*  pointers[3];
   struct something* ptr;
}

struct lmap* pl;
void* ph = dlopen(NULL, RTLD_NOW);
struct something* p = (struct something*)ph;
p = p->ptr;
pl = (struct lmap*)p->ptr;


List Loaded Objects
Now we are ready to list all loaded objects. Assume p is a pointer to the first link_map (in our case lmap) structure:

while(NULL != p)
{
   printf("%s\n", p->path);
   p = p->next;
}

In my case the output is (about three times less than in a Windows process ;-) ):
/lib32/libdl.so.2
/lib32/libc.so.6
/lib/ld-linux.so.2

C'est tous. We are done. The mechanism described above may be used in order to either enumerate loaded shared objects or to get their handles. I personally used it for amusement.

Just remember, that in Linux, unlike Windows, handle to an object is not its base address, but the address of (pointer to) the corresponding link_map structure.


Hope this post was at least interesting (if not helpful). See you at the next!

Friday, December 16, 2011

Executable Code Injection the Interesting Way

So. Executable code injection. In general, this term is associated with malicious intent. It is true in many cases, but in, at least, as many, it is not. Being malware researcher for the most of my career, I can assure you, that this technique appears to be very useful when researching malicious software, as it allows (in most cases) to defeat its protection and gather much of the needed information. Although, it is highly recommended not to use such approach, sometimes it is simply unavoidable.

There are several ways to perform code injection. Let's take a look at them.

DLL Injection
The most simple way to inject a DLL into another process is to create a remote thread in the context of that process by passing the address of the LoadLibrary API as a ThreadProc. However, it appears to be unreliable in modern versions of Windows due to the address randomization (which is currently not true, but who knows, may be once it becomes real randomization).

Another way, a bit more complicated, implies a shell code to be injected into the address space of another process and launched as a remote thread. This method offers more flexibility and is described here.

Manual DLL Mapping
Unfortunately, it has become fashionable to give new fancy names to the old good techniques. Manual DLL Mapping is nothing more than a complicated code injection. Why complicated, you may ask - because it involves implementation of custom PE loader, which should be able to resolve relocations. Adhering the Occam's Razor principle, I take the responsibility to claim, that it is much easier and makes more sense to simply allocate memory in another process using VirtualAllocEx API and inject the position independent shell code. 

Simple Code Injection
As the title of this section states, this is the simplest way. Allocate a couple of memory blocks in the address space of the remote process using VirtualAllocEx (one for code and one for data), copy your shell code and its data into those blocks and launch it as a remote thread.

All the methods listed above are covered well on the Internet. You may just google for "code injection" and you will get thousands of well written tutorials and articles. My intention is to describe a more complex, but also a more interesting way of code injection (in a hope that you have nothing else to do but try to implement this).

Before we start:
Another note for nerds. 
  • The code in this article does not contain any security checks unless it is needed as an example.
  • This is not malware writing tutorial, so I do not care whether the AV alerts when you try to use this method.
  • No, manual DLL mapping is not better ;-).
  • Neither do I care about how stable this solution is. If you decide to implement this, you will be doing it at your own risk.
Now, let's have some fun.




Disk vs Memory Layout
Before we proceed with the explanation, let's take a look at the PE file layout, whether on disk or in memory, as our solution relies on that.

This layout is logically identical for both PE files on disk and PE files in memory. The only differences are that some parts may not be present in memory and, the most important for us, on disk items are aligned by "File Alignment" while in memory they are aligned by "Page Alignment" values, which, in turn may be found in the Optional Header. For full PE COFF format reference check here.

Right now, we are particularly interested in sections that contain executable code ((SectionHeader.characteristics & 0x20000020) != 0). Usually, the actual code does not fill the whole section, leaving some parts simply padded by zeros. For example, if our code section only contains 'ExitProcess(0)', which may be compiled into 8 bytes, it will still occupy FileAlignment bytes on disk (usually 0x200 bytes). It will take even more space in memory, as the next section may not be mapped closer than this_section_virtual_address + PageAlignement  (in this particular case), which means that if we have 0x1F8 free bytes when the file is on disk, we'll have 0xFF8 free bytes when the file is loaded in memory.
The "formula" to calculate available space in code section is next_section_virtual_address - (this_section_virtual_address + this_section_virtual_size) as virtual size is (usually) the amount of actual data in section. Remember this, as that is the space that we are going to use as our injection target.
It may happen, that the target executable does not have enough spare space for our shell code, but let this not bother you too much. A process contains more than one module (the main executable and all the DLLs). This means that you can look for spare space in the code sections of all modules. Why only code sections? Just in order not to mess too much with memory protection.

Shellcode
The first and the most important rule for shellcode - it MUST be position independent. In our case, this rule is especially unavoidable (if you may say so) as it is going to be spread all over the memory space (depends on the size of your shell code, of course). 

The second, but not less important rule - carefully plan your code according to your needs. The less space it takes, the easier the injection process would be.

Let's keep our shell code simple. All it would do is interception of a single API (does not matter which one, select whichever you want from the target executable's import section), and show a message box each time that API is called (you should probably select ExitProcess for interception if you do not want the message box popping up all the time).

Divide your shellcode into independent functional blocks. By independent, I mean that it should not have any direct or relative calls or jumps. Each block should have one data field, which would contain the address of the table containing addresses of all our functions (and data if needed). Such mechanism would allow us to spread the code all over the available space in different modules without the need to mess with relocations at all.

The picture on the left and the diagram below will help you to better understand the concept. 
Init - our initialization function. Once the code is injected, you would want to call this function as a remote thread.
Patch - this block is responsible for actually patching the import table with the address of our Fake.

The code in each of the above blocks will have to access Data in order to retrieve addresses of functions from other blocks.

Your initialization procedure would have to locate the KERNEL32.DLL in memory in order to obtain the addresses of LoadLibrary (yes, it would be better to use LoadLibrary rather then GetModuleHandle), GetProcAddress and VirtualProtect API functions which are crucial even for such a simple task as patching one API call. Those addresses would be stored in Data.

The Injector
While the shellcode is pretty trivial (at least in this particular case), the injector is not. It will not allocate memory in the address space of another process (if possible, of course). Instead, it will parse the the PEB (Process Environment Block) of the victim in order to get the list of loaded modules. Once that is done, it will parse section headers of every module in order to create list of available memory locations (remember, we prefer code sections only) and fill the Data block with appropriate addresses. Let's take a look at each step.

First of all, it may be a good idea to suspend the process by calling SuspendThread function on each of its threads. You may want to read this post about threads enumeration. One more thing to remember is to open the victim process with the following flags: PROCESS_VM_READ | PROCESS_VM_OPERATION | PROCESS_VM_WRITE | PROCESS_QUERY_INFORMATION | PROCESS_SUSPEND_RESUME in order to be able to perform all the following operations. The function itself is quite simple:

DWORD WINAPI SuspendThread(__in HANDLE hThread);

Don't forget to resume all threads with ResumeThread once the injection is done.

The next step would be calling the NtQueryInformationProcess function from the ntdll.dll. The only problem with it is that it has no associated import library and you will have to locate it with GetProcAddress(GetModuleHandle("ntdll.dll"), "NtQueryInformationProcess"), unless you have a way to explicitly specify it in the import table of your injector. Also, try LoadLibrary if the GetModuleHandle does not work for you.

NTSTATUS WINAPI NtQueryInformationProcess(
   __in      HANDLE ProcessHandle,
   __in      PROCESSINFOCLASS ProcessInformationClass, /* Use 0 in order to 
                                               get the PEB address */
   __out     PVOID ProcessInformation,  /* Pointer to the PROCESS_BASIC_INFORMATION
                                                       structure */
   __in      ULONG ProcessInformationLength, /* Size of the PROCESS_BASIC_INFORMATION
                                                     structure in bytes */
   __out_opt PULONG ReturnLength
);

typedef struct _PROCESS_BASIC_INFORMATION
{
   PVOID     ExitStatus;
   PPEB      PebBaseAddress;
   PVOID     AffinityMask;
   PVOID     BasePriority;
   ULONG_PTR UniqueProcessId;
   PVOID     InheritedFromUniqueProcessId;
} PROCESS_BASIC_INFORMATION;

The NtQueryInformationProces will provide you with the address of the PEB of the victim process. This post will explain you how to deal with PEB content. Of course, you will not be able to access that content directly (as it is in the address space of another process), so you will have to use WriteProcessMemory and ReadProcessMemory functions for that.

BOOL WINAPI WriteProcessMemory(
   __in   HANDLE   hProcess,
   __in   LPVOID   lpBaseAddress,  /* Address in another process */
   __in   LPCVOID  lpBuffer,  /* Local buffer */
   __in   SIZE_T   nSize,  /* Size of the buffer in bytes */
   __out  SIZE_T*  lpNumberOfBytesWritten
};

BOOL WINAPI ReadProcessMemory(
   __in   HANDLE   hProcess,
   __in   LPCVOID  lpBaseAddress, /* Address in another process */
   __out  LPVOID   lpBuffer,  /* Local buffer */
   __in   SIZE_T   nSize,  /* Size of the buffer in bytes */
   __out  SIZE_T*  lpNumberOfBytesRead
};

Due to the fact that you are going to deal with read only memory locations, you should call VirtualProtectEx in order to make those locations writable (PAGE_EXECUTE_READWRITE).  Don't forget to restore memory access permissions to PAGE_EXECUTE_READ when you are done. 

BOOL WINAPI VirtualProtectEx(
   __in  HANDLE hProcess,
   __in  LPVOID lpAddress, /* Address in another process */
   __in  SIZE_T dwSize,  /* Size of the range in bytes */
   __in  DWORD  flNewProtect, /* New protection */
   __out PDWORD lpflOldProtect
};

You may also want to change the VirtualSize of those sections of the victim process you used for injection in order to cover the injected code. Just adjust it in the headers in memory.

That's all folks. Let me leave the hardest part (writing the code) up to you this time. 

Hope this post was interesting and see you at the next.