Search This Blog

Showing posts with label obfuscation. Show all posts
Showing posts with label obfuscation. Show all posts

Saturday, May 19, 2012

Simple Runtime Framework by Example

Source code for this article may be found here.

These days we are simply surrounded by different software frameworks. Just to name a few: Java, .Net and, actually, many more. Have you ever wondered how those work or have you ever wanted or needed to implement one? In this article, I will cover a simple or even trivial runtime framework.

As usual - note for nerds:
The source code given in this article is for example purposes only. I know that this framework is far from being perfect, therefore, this article is not a howto or tutorial - just an explanation of principle. Error checks are omitted on purpose. You want to implement a real framework - do it yourself, including error checks.

Now, to let's get to business.

Software Framework
Wikipedia gives the following identification for the term "Software Framework" - "A software framework is a universal, reusable software platform used to develop applications, products and solutions. Software Frameworks include support programs, compilers, code libraries, an application programming interface (API) and tool sets that bring together all the different components to enable development of a project or solution". As you can see, software framework is quite a complex thing. However, let's simplify it and see how it basically work.

Figure 1.
Software Framework
The diagram on the left may give you a good understanding of what Software Framework is and what role it performs. Simply saying, it is a shim between the user application and the Operating System. There are at least two types of Software Frameworks:

  1. Application Programming Interface (API) - if we take a look at Windows API, we may see that it is a framework as well. However, it may be bypassed or, at least, a programmer may choose to decrease the interaction with it by, for example, using functions from ntdll.dll instead of those provided by kernel32.dll or even "talk" to Windows kernel directly (highly not recommended, but may be unavoidable some times) through interrupts.
  2. .Net like framework - total isolation of user code from the operating system. Such frameworks are mostly virtual machines totally isolating user application from the operating system and hardware. However, such framework has to provide the application with all the services available in the Operating System. This is type of framework we are going to build in this article.




Virtual Machine
The basics of building a simple virtual machine is covered in this article, so I will only give a brief explanation here. Our VM in this example will consist of the following components:
  1. Virtual CPU
    A structure that represents a CPU - basically, has 6 registers and a pointer to the stack:

    typedef struct
    {
       unsigned int  regs[6];
       unsigned int* stack;
    }CPU;

    The 6 registers are general purpose
    A, B, C and D, where A is also used to store system call return value and C is used as a counter for LOOP instruction, STACK POINTER (SP) and INSTRUCTION POINTER (IP).
  2. Instruction Interpreter
    A function or a set of functions which responsible for interpretation of the pseudo assembly (or call it intermediate assembly language) designed for this virtual machine (in this case 14 instructions).
  3. System Call Handler
    This component provides the means for the user application to interact with the Operating System (in this case 2 system calls:
    sys_write and sys_exit).

Core Function
The name of the function speaks for itself. This is the first function of the framework implementation which gains control. In this particular case, it does not have too many things to do - initialization of the virtual CPU and execution of the command interpreter, until the user application exits (signals the framework to terminate the execution).

Implementation
It is a common practice to implement a framework as a DLL (dynamic link library), for example, mscoree.dll - the core of the .Net framework. I do not see any reason to reinvent the wheel, therefore, this framework will be implemented as a DLL as well.

All is fine, you may say, but how should we pass the compiled pseudo assembly code to the framework? Well, I bet, most of you know how to do that. In case you don't - no worries, just keep reading.

In case of .Net framework (at least as far as I know), the loader identifies a file as a .Net executable, reads in the meta header, and initializes the mscoree.dll appropriately. We will not go through all those complications and will use a regular PE file:


Figure 2.
Customized PE file.

  • PE Header - regular PE Header, no modification needed;
  • Code Section - simply invokes the core function of the framework:

    push pseudo_code_base_address
    call [core]
  • Import Section - regular import section that only imports one function from the framework.dll - framework.core(unsigned int);
  • Data Section - this section contains the actual compiled pseudo assembly code and whatever headers you may come up with, that may instruct the core() function to correctly initialize the application.






Example Executable Source Code
The following is the source code of the example executable. It may be compiled with FASM (Flat Assembler).

include 'win32a.asm' ;we need the 'import' macro
include 'asm.asm'    ;pseudo assembly commands and constants

format PE console
entry start

section '.text' readable executable
start:
   push _base
   call [core_func]

section '.idata' data import writeable
library  framework, 'framework.dll'

import framework,\
   core_func, 'Core'

section '.data' readable writeable
_base:
   loadi A, _base
   loadi B, 0x31
   _add A, B
   loadr B, A
   loadi A, _data.string
   loadi C, _data.string_len
   _call _func
   loadi A, 1
   loadi B, _data.string
   loadi C, _data.str_len
   _int sys_write
   loadi A, 1
   loadi B, _data.msg
   loadi C, _data.msg_len
   _int sys_write
   _int sys_exit


_func:
   ; A = string address
   ; B = key
   ; C = counter
.decode:
   loadr D, A
   xorr D, B
   storr A, D
   loadi D, 4
   _add A, D
   _loop .decode
   _ret



_data:
.string db 'Hello, developer!', 10, 13
.str_len = $-.string
db 0
.string_len = ($-.string)/4
.msg db 'The program will now exit.', 10, 13
.msg_len = $-.msg

;Encrypt one string
load k dword from _base + 0x31
repeat 5
load a dword from _data.string + (% - 1) * 4
a = a xor k
store dword a at _data.string + (% - 1) * 4
end repeat



The code above produces a tiny executable which invokes framework's core() function. Pseudo assembly code simply prints two messages (the first one is decoded prior to being printed). Full sources are attached to this article (see the very first line).

The good thing is that you do not have to start the interpreter and load this executable (or specify it as a command line parameter) - you may simply run this executable, Windows loader will bind it with the framework.dll automatically. The bad thing is that you would, most probably, have to write your own compiler, because writing assembly is fun, dealing with pseudo assembly is fun as well, BUT, only when done for fun. It is not as pleasant when dealing with production code.


Possible uses
Unless you are trying to create a framework that would overcome existing software frameworks, you may use such approach to increase the protection of your applications by, for example, virtualizing cryptography algorithms or any other part of your program which is not essential by means of execution speed, but represents a sensitive intellectual property.

Hope you find this article helpful.

See you at the next!




Thursday, May 17, 2012

Basics of Data Obfuscation

Source code for this article may be found here.

One of the aspects of software anti RE (reverse engineering) protection is the need to protect sensitive data (for example decryption or license keys, etc.) There is quite a common practice of storing such data in encrypted form and using it by passing to a certain routine for decryption. I am not going to say, that this is not a good idea, but the problem with such approach is - vendors (in most cases) only rely on the complexity of the encryption algorithm, which is not as protective as it is thought to be and too often is placed in a single function (which, potentially, may be ripped and used with malicious intent).

I have already covered the basics of executable code obfuscation in this article, now it's time to take a look at how data may be hidden (this approach may be used with executable code as well) by, for example, putting it on stack and using several separate functions to reconstruct the original data.

The idea of hiding data in uninitialized variables (of which I am going to talk here) is not new at all, but still is rarely used, if at all.

Note for nerds: 
This is not a tutorial, neither a howto. This is a basic explanation of the concept (no, this is not my invention and yes, there are other ways). The supplied code may be not perfect. It may contain bugs and is given here as an example only.


Needle and the Haystack

While needle is the data we want to hide, haystack is our whole program. You may hide data anywhere - data section, code section, etc. You may even spread parts of the data throughout the program. In this particular example, the data is pretended to be a part of the key computation algorithm. We will reconstruct the data on the stack (this is thread safe as every thread has its own stack in either way). 

As this is (and I will reiterate this) just an example, our program is quite short:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define DATA_SIZE 16

int main(int argc, char** argv)
{
   unsigned int key;
   char*        str;
   char*        res = (char*)malloc(sizeof(char) * DATA_SIZE);

   // Calculate pseudo key
   key = CalcKey(0x12345678);

   // Mutate the key (get the actual key)
   key = Mutate(0);

   // Get the pointer to the data
   str = GetPtr();

   // Decode the data 
   Decode();

   // Copy the data to a buffer
   memcpy(res, str, sizeof(char) * DATA_SIZE);

   // Print the data (which is actually a string)
   puts(res);

   return 0;
}

As you may see, there is a set of functions used to construct the hidden data (functions are written in assembler):


unsigned int CalcKey(unsigned int seed);

Uses "seed" to start preparing the decryption key. The value returned by such function should be used somewhere, for any kind of "decryption" operation, just in order to lead the attacker astray. You may say, that sooner or later, this move would be disclosed and the attacker would get back to this point and revise it and you will be right. However, given that "real life" implementation should be more complicated then the following code, it would take a while until the real purpose is discovered. Even more than that, it would still scare away some "hackers".

The following code is the implementation of the CalcKey function used in this example:

calc_key:
   push  ebp
   mov   ebp, esp
   push  edi
   
   sub   esp, 0x14            ;this is the amount of bytes we would need 
                              ;for data reconstruction
   and   dword [ebp - 4], 0   ;forming the "key"
   dec   dword [ebp - 4]
   mov   eax, [ebp + 8]
   xor   dword [ebp - 4], eax ;by this line, the real key is half ready
   mov   eax, [ebp - 4]


   lea   edi, [ebp - 0x14]    ;go on with making the fake key
   push  eax
   mov   eax, 0x5A309FC0
   stosd
   mov   eax, 0x617CD6E7
   stosd
   mov   eax, 0x523088E7
   stosd
   mov   eax, 0x365CFAA9
   stosd
   pop   eax


   xor   eax, dword[ebp - 8]
   xor   eax, dword[ebp - 12]
   xor   eax, dword[ebp - 16]
   xor   eax, dword[ebp - 20] ;pseudo key is ready
                              ;as you see, the return value is the pseudo key
                              ;the real one remains on stack
   pop   edi
   leave
   ret

The highlighted constants, which may seem to be a part of the pseudo key calculation are in fact the data. As you can see, we put it on stack and "forget" there. It is important to mention, that you have to be careful if you decide to use stack for this purpose, and make sure the data is not being overwritten by  subsequent calls to other functions. In order to make sure this does not happen, the suggestion is to put the actual data further into the stack (e.g. at [ebp - 0x100] instead of [ebp - 0x14] or even further).

I would say it again - make use of the pseudo key somewhere.



unsigned int Mutate(unsigned int dummy);

"dummy" is a dummy parameter and my personal suggestion is to do some manipulations with it. This function may seem as the one that produces different keys derived from the pseudo key computed by CalcKey depending on the "dummy" parameter. Well, it does. But those keys are not used in this example. What it does in deed, is mutating the half generated key, which is still present on stack where we left it in the CalcKey function (if it is not - check your code), and finalizing the key generation process.

mutate:
   push  ebp
   mov   ebp, esp
   
   mov   eax, [ebp - 4]
   rol   dword[ebp - 4], 1
   xor   dword[ebp - 4], eax  ;finalize real key computation


   xor   eax, [ebp + 8]       ;make use of "dummy" parameter


   leave
   ret            

Once this function returns, we have a ready to use key somewhere in the stack space. A small note to satisfy nerds (as others should know this by default)  - you should not call these functions one after the other in real life.


unsigned char* GetPtr(void);

This is the most simple (meaning short) function. All it does - returns a pointer to the location of data inside the stack area.

get_ptr:
   push  ebp
   mov   ebp, esp
   sub   esp, 0x14
   mov   eax, esp
   leave
   ret


In case of this example, the GetPtr() function returns the pointer itself, however, you may make it return any value that allows you to form a real pointer to the real data. Another recommendation is to call this function before the data gets decrypted so that it may be considered a pointer to immediate.


void Decode(void);

Finally, the end of this "complicated" procedure - decoding the actual data with the actual key.

decode:
   push  ebp
   mov   ebp, esp


   mov   eax, [ebp - 4]      ;remember? the real key should still be here
   xor   [ebp - 8], eax      ;decode the data
   xor   [ebp - 12], eax
   xor   [ebp - 16], eax
   xor   [ebp - 20], eax


   leave
   ret
   
Upon return from this function, the pointer obtained with GetPtr() would point to the decrypted data which is still on stack. Suggestion is - move it from there and overwrite that stack area with whatever you want.

Compiling and running the attached code would print the famous "Hello, World!" string to the terminal.

Hope I managed to explain the idea and that you may find this article interesting.

See you at the next!



Friday, March 2, 2012

Dynamic Code Encryption as an Anti Dump and Anti Reverse Engineering measure

Source code for this article may be found here.


There has been said and written too much on how software vendors do not protect their products, so let me skip this. Instead, in this article, I would like to concentrate on those relatively easy steps, which software vendors have to take in order to enhance their protection (using packers and protectors is good, but certainly not enough) by not letting the whole code appear in memory in readable form for a single moment.

Attack Vectors
Prior to dealing with "why attackers are able to x, y, z" let us map most frequent attack vectors in ascending order of their complexity.

Static Analysis - inspecting an executable in your favorite disassembler. It may be hard to believe, but majority of software products out there are vulnerable to static analysis, thus, showing us, that most of vendors do not care about proprietary algorithms' safety in addition to the fact, that they seem not to care about piracy  either (but they tend to cry about it all the time).

Dynamic Analysis - running an executable inside your favorite debugger. This is a direct consequence of the previous paragraph. If an attacker is able to see the whole code in the disassembler - he/she definitely can run it  in a debugger (even if this requires some minor patching).

Static Patching - this means changing the code located in the file of the executable. It may be changing one jump or adding a couple of dozens of bytes of attacker's own code in order to alter the way the program runs.

Dynamic Patching - similar to static patching in the idea behind the method. The only difference is, that dynamic patching is performed while the target executable is loaded into memory.

Dumping - saving the data in memory to a file on disk. This method may be very useful when examining a packed executable. Such memory dumps may be easily loaded into, for example, IDA and examined as if that was a regular executable (some additional actions may be required for better convenience, like rebasing the program or adjusting references to other modules).

In most cases, at least two of the aforementioned vectors would be present in time of attack.


Packers and Cryptors
Using different packers, cryptors and protectors is quite a known practice among software vendors. The problem with this is, that few of them go beyond packing the code in file and fully unpacking it in memory and, sometimes, protecting the packer itself. By saying "go beyond" I mean any implementation of anti debugging methods of any kind. Besides, such utilities do not prevent an attacker from obtaining a memory dump good enough to deal with. One or two check the consistency of the code, which may (yes - may, as it not necessarily can) prevent patching the code, but every wall has a door and it only matters how much effort opening that door may require. Bottom line is, that these types of protection may only be useful in preventing static analysis, but only if there is no relevant unpacker or decrypter.


Protectors
This is "the next step" in the evolution of packers. These provide a bit more options and tools to estimate how secure the environment is. In addition to packing the code, they also utilize code consistency checks, anti debugging tricks, license verification, etc. Protectors are good countermeasures to the first three (or even four) attack vectors. However, even if certain protector has some anti patching heuristics, it is only good as long as it (heuristics) is not reversed and either patched or fooled in any other way. 

Despite all the "good" in protectors, even such powerful tools are not able to do much in order to prevent an attacker from obtaining a memory dump, which may be obtained by either using ReadProcessMemory or injecting a DLL and dumping "from inside" while suspending all other threads.


Anything Else?
Yes, there are some basic protections provided by the operating system, like session separation, for example, which prevents creation of remote threads (used with DLL injection), but those are hardly worth even mentioning here.

The picture drawn here appears to be sad and hopeless enough. However, there are several good methods to add more protection to a software product and more pain in some parts of the body to attackers.


Code Obfuscation
While this methods is widely used by protectors and, sometimes, by packers and cryptors (unfortunately, in most cases, for protecting themselves only) it seems to be almost totally unknown to the rest of software vendors. In my opinion, branching the code more than it is usually needed may not be considered as code obfuscation, it may rather be called an attempt to obfuscate an algorithm. The situation is such, that even implementation of something similar to this would be a significant improvement in vendors' efforts to protect their products.


Hiding the Code
Software vendors repeatedly fail at understanding two facts - popular means more vulnerable (in regard of commercial solutions) and the fact that there is no magic cure and they have to put some additional effort into protecting their products.

One of the options, which I would like to cover here, is dynamic encryption of executable code. This method promises that only certain parts of the code would be present in memory in readable (possible to disassemble) form, while the rest of the code (and preferably data) is encrypted.

I am still sure - the best way to explain something is explanation by example. The small piece of C code described below is intended to show the principle of dynamic code encryption. It contains several functions in addition to main - the first is the one (the target) we are going to protect. It does nothing special, just calculates the factorial of 10 and prints it out. The main function invokes a decrypter in order to decrypt the target, calls the target (thus, displaying the factorial of 10) and, finally invokes cryptor to encrypt the target back (hide it).

The code may be compiled for both Linux (using gcc) or Windows (using mingw32). It uses obfuscation code from here.


Target Function
Our target function is quite simple (it only calculates factorial for hardcoded number):

void func()
{
   __asm__ __volatile__("enc_start:");
   {  /* Braces are used here as we do not want IDA to track parameters */
      int i, f = 10;
      for( i = 9; i > 0; i--)
         f *= i;
      printf("10! = %d\n\n", f);
   }
   __asm__ __volatile__("enc_end:");
}

You noticed the labels in the beginning and in the end of the function body? These labels are only used for getting the start address of the region to be decrypted/encrypted and calculating it's length. Due to the fact that these labels are no processed by the C preprocessor, but are passed to assembler, they are accessible from other functions by default. The rest of the code is enclosed by braces in order to put all the actions related to variables i and f in the encrypted part of the function. This is what it looks like, before being decrypted:


Although, in attached code, the initial encryption is performed upon program start, in reality, it should be done with, probably, a third party tool. You would only have to put some unique marking at the start and end of the region you want to encrypt. For example:


__asm__(".byte  0x0D, 0xF0, 0xAD, 0xDE");
void  func()
{
...
}
__asm__(".byte  0xAD, 0xDE, 0xAD, 0xDE");


Encryption Algorithm
Selection of encryption algorithm is totally up to you. In this particular case, the algorithm is quite primitive (it does not even require a key):

b  - byte
i  - position
for i = 0; i < length; i++
   b(i+1) = b(i+1) xor (b(i) rol 1)
b(0) = b(0) xor (b(length) rol 1)

Execution Flow
So, let us assume that the program started with the function already encrypted. As this is just an example, we can get to the business right away:

int main()
{
   unsigned int  addr, len;
   __asm__ __volatile__("movl  $enc_start, %0\n\t"\
                        "movl  $enc_end, %1\n\t"\
                        : "=r"(addr), "=r"(len));
   len -= addr;
   decode(addr, len);
   func();
   encode(addr, len);
   return 0;
}

The code above is self explanatory enough. There are, however, a couple of things needed to be mentioned. decode and encode functions should take care of modifying the access rights of the memory region they are going to operate on. The following code may be used:

#ifdef WIN32
#include <windows.h>
#define SETRWX(addr, len)   {\
                               DWORD attr;\
                               VirtualProtect((LPVOID)((addr) &~ 0xFFF),\
                                  (len) + ((addr) - ((addr) &~ 0xFFF)),\
                                  PAGE_EXECUTE_READWRITE,\
                                  &attr);\
                            }
#define SETROX(addr, len)   {\
                               DWORD attr;\
                               VirtualProtect((LPVOID)((addr) &~ 0xFFF),\
                                  (len) + ((addr) - ((addr) &~ 0xFFF)),\
                                  PAGE_EXECUTE_READ,\
                                  &attr);\
                            }
#else
#include <sys/mman.h>
#define SETRWX(addr, len)   mprotect((void*)((addr) &~ 0xFFF),\
                                     (len) + ((addr) - ((addr) &~ 0xFFF)),\
                                     PROT_READ | PROT_EXEC | PROT_WRITE)
#define SETROX(addr, len)   mprotect((void*)((addr) &~ 0xFFF),\
                                     (len) + ((addr) - ((addr) &~ 0xFFF)),\
                                     PROT_READ | PROT_EXEC)
#endif

This is the only platform dependent code in this sample.

Bottom Line
The example given above is really a simple one. Things would be at least a bit more complicated in real life. While there is only one encrypted function, imagine, that there are several encrypted functions. Some of them are encrypted without keys (like the one above) others require keys of different complexity. Several keys may be hardcoded (for those parts that were encrypted in order to draw attacker's attention away from the "real" thing), others should be computed on the fly.

Example:
Function A is encrypted without a key. When decrypted, it performs several operations and decrypts function B, which, in turn encrypts function A back and calculates a key for function C based on the binary content of function A (or A and B to prevent breakpoints) or even based on some other code in unrelated place.

Of course, there is no such thing as unbreakable protection. But the time it takes to break certain protection makes the difference. A company that produces software product which is cracked the next day may hardly benefit from all the hard work. On the other hand, it is totally possible to create protection schemes that would require months to be cracked.

I will try and cover additional possibilities and aspects of software protection in my future posts in a hope to at least try to change the situation.


Hope this post was helpful.
See you at the next!















Friday, December 16, 2011

Executable Code Injection the Interesting Way

So. Executable code injection. In general, this term is associated with malicious intent. It is true in many cases, but in, at least, as many, it is not. Being malware researcher for the most of my career, I can assure you, that this technique appears to be very useful when researching malicious software, as it allows (in most cases) to defeat its protection and gather much of the needed information. Although, it is highly recommended not to use such approach, sometimes it is simply unavoidable.

There are several ways to perform code injection. Let's take a look at them.

DLL Injection
The most simple way to inject a DLL into another process is to create a remote thread in the context of that process by passing the address of the LoadLibrary API as a ThreadProc. However, it appears to be unreliable in modern versions of Windows due to the address randomization (which is currently not true, but who knows, may be once it becomes real randomization).

Another way, a bit more complicated, implies a shell code to be injected into the address space of another process and launched as a remote thread. This method offers more flexibility and is described here.

Manual DLL Mapping
Unfortunately, it has become fashionable to give new fancy names to the old good techniques. Manual DLL Mapping is nothing more than a complicated code injection. Why complicated, you may ask - because it involves implementation of custom PE loader, which should be able to resolve relocations. Adhering the Occam's Razor principle, I take the responsibility to claim, that it is much easier and makes more sense to simply allocate memory in another process using VirtualAllocEx API and inject the position independent shell code. 

Simple Code Injection
As the title of this section states, this is the simplest way. Allocate a couple of memory blocks in the address space of the remote process using VirtualAllocEx (one for code and one for data), copy your shell code and its data into those blocks and launch it as a remote thread.

All the methods listed above are covered well on the Internet. You may just google for "code injection" and you will get thousands of well written tutorials and articles. My intention is to describe a more complex, but also a more interesting way of code injection (in a hope that you have nothing else to do but try to implement this).

Before we start:
Another note for nerds. 
  • The code in this article does not contain any security checks unless it is needed as an example.
  • This is not malware writing tutorial, so I do not care whether the AV alerts when you try to use this method.
  • No, manual DLL mapping is not better ;-).
  • Neither do I care about how stable this solution is. If you decide to implement this, you will be doing it at your own risk.
Now, let's have some fun.




Disk vs Memory Layout
Before we proceed with the explanation, let's take a look at the PE file layout, whether on disk or in memory, as our solution relies on that.

This layout is logically identical for both PE files on disk and PE files in memory. The only differences are that some parts may not be present in memory and, the most important for us, on disk items are aligned by "File Alignment" while in memory they are aligned by "Page Alignment" values, which, in turn may be found in the Optional Header. For full PE COFF format reference check here.

Right now, we are particularly interested in sections that contain executable code ((SectionHeader.characteristics & 0x20000020) != 0). Usually, the actual code does not fill the whole section, leaving some parts simply padded by zeros. For example, if our code section only contains 'ExitProcess(0)', which may be compiled into 8 bytes, it will still occupy FileAlignment bytes on disk (usually 0x200 bytes). It will take even more space in memory, as the next section may not be mapped closer than this_section_virtual_address + PageAlignement  (in this particular case), which means that if we have 0x1F8 free bytes when the file is on disk, we'll have 0xFF8 free bytes when the file is loaded in memory.
The "formula" to calculate available space in code section is next_section_virtual_address - (this_section_virtual_address + this_section_virtual_size) as virtual size is (usually) the amount of actual data in section. Remember this, as that is the space that we are going to use as our injection target.
It may happen, that the target executable does not have enough spare space for our shell code, but let this not bother you too much. A process contains more than one module (the main executable and all the DLLs). This means that you can look for spare space in the code sections of all modules. Why only code sections? Just in order not to mess too much with memory protection.

Shellcode
The first and the most important rule for shellcode - it MUST be position independent. In our case, this rule is especially unavoidable (if you may say so) as it is going to be spread all over the memory space (depends on the size of your shell code, of course). 

The second, but not less important rule - carefully plan your code according to your needs. The less space it takes, the easier the injection process would be.

Let's keep our shell code simple. All it would do is interception of a single API (does not matter which one, select whichever you want from the target executable's import section), and show a message box each time that API is called (you should probably select ExitProcess for interception if you do not want the message box popping up all the time).

Divide your shellcode into independent functional blocks. By independent, I mean that it should not have any direct or relative calls or jumps. Each block should have one data field, which would contain the address of the table containing addresses of all our functions (and data if needed). Such mechanism would allow us to spread the code all over the available space in different modules without the need to mess with relocations at all.

The picture on the left and the diagram below will help you to better understand the concept. 
Init - our initialization function. Once the code is injected, you would want to call this function as a remote thread.
Patch - this block is responsible for actually patching the import table with the address of our Fake.

The code in each of the above blocks will have to access Data in order to retrieve addresses of functions from other blocks.

Your initialization procedure would have to locate the KERNEL32.DLL in memory in order to obtain the addresses of LoadLibrary (yes, it would be better to use LoadLibrary rather then GetModuleHandle), GetProcAddress and VirtualProtect API functions which are crucial even for such a simple task as patching one API call. Those addresses would be stored in Data.

The Injector
While the shellcode is pretty trivial (at least in this particular case), the injector is not. It will not allocate memory in the address space of another process (if possible, of course). Instead, it will parse the the PEB (Process Environment Block) of the victim in order to get the list of loaded modules. Once that is done, it will parse section headers of every module in order to create list of available memory locations (remember, we prefer code sections only) and fill the Data block with appropriate addresses. Let's take a look at each step.

First of all, it may be a good idea to suspend the process by calling SuspendThread function on each of its threads. You may want to read this post about threads enumeration. One more thing to remember is to open the victim process with the following flags: PROCESS_VM_READ | PROCESS_VM_OPERATION | PROCESS_VM_WRITE | PROCESS_QUERY_INFORMATION | PROCESS_SUSPEND_RESUME in order to be able to perform all the following operations. The function itself is quite simple:

DWORD WINAPI SuspendThread(__in HANDLE hThread);

Don't forget to resume all threads with ResumeThread once the injection is done.

The next step would be calling the NtQueryInformationProcess function from the ntdll.dll. The only problem with it is that it has no associated import library and you will have to locate it with GetProcAddress(GetModuleHandle("ntdll.dll"), "NtQueryInformationProcess"), unless you have a way to explicitly specify it in the import table of your injector. Also, try LoadLibrary if the GetModuleHandle does not work for you.

NTSTATUS WINAPI NtQueryInformationProcess(
   __in      HANDLE ProcessHandle,
   __in      PROCESSINFOCLASS ProcessInformationClass, /* Use 0 in order to 
                                               get the PEB address */
   __out     PVOID ProcessInformation,  /* Pointer to the PROCESS_BASIC_INFORMATION
                                                       structure */
   __in      ULONG ProcessInformationLength, /* Size of the PROCESS_BASIC_INFORMATION
                                                     structure in bytes */
   __out_opt PULONG ReturnLength
);

typedef struct _PROCESS_BASIC_INFORMATION
{
   PVOID     ExitStatus;
   PPEB      PebBaseAddress;
   PVOID     AffinityMask;
   PVOID     BasePriority;
   ULONG_PTR UniqueProcessId;
   PVOID     InheritedFromUniqueProcessId;
} PROCESS_BASIC_INFORMATION;

The NtQueryInformationProces will provide you with the address of the PEB of the victim process. This post will explain you how to deal with PEB content. Of course, you will not be able to access that content directly (as it is in the address space of another process), so you will have to use WriteProcessMemory and ReadProcessMemory functions for that.

BOOL WINAPI WriteProcessMemory(
   __in   HANDLE   hProcess,
   __in   LPVOID   lpBaseAddress,  /* Address in another process */
   __in   LPCVOID  lpBuffer,  /* Local buffer */
   __in   SIZE_T   nSize,  /* Size of the buffer in bytes */
   __out  SIZE_T*  lpNumberOfBytesWritten
};

BOOL WINAPI ReadProcessMemory(
   __in   HANDLE   hProcess,
   __in   LPCVOID  lpBaseAddress, /* Address in another process */
   __out  LPVOID   lpBuffer,  /* Local buffer */
   __in   SIZE_T   nSize,  /* Size of the buffer in bytes */
   __out  SIZE_T*  lpNumberOfBytesRead
};

Due to the fact that you are going to deal with read only memory locations, you should call VirtualProtectEx in order to make those locations writable (PAGE_EXECUTE_READWRITE).  Don't forget to restore memory access permissions to PAGE_EXECUTE_READ when you are done. 

BOOL WINAPI VirtualProtectEx(
   __in  HANDLE hProcess,
   __in  LPVOID lpAddress, /* Address in another process */
   __in  SIZE_T dwSize,  /* Size of the range in bytes */
   __in  DWORD  flNewProtect, /* New protection */
   __out PDWORD lpflOldProtect
};

You may also want to change the VirtualSize of those sections of the victim process you used for injection in order to cover the injected code. Just adjust it in the headers in memory.

That's all folks. Let me leave the hardest part (writing the code) up to you this time. 

Hope this post was interesting and see you at the next.