Thursday, December 22, 2011

Simple Virtual Machine

Sample code for this article may be found here.

In computing, Virtual Machine (VM) is a software implementation of either existing or a fictional hardware platform.  VM's are generally divided into two classes - system VM (VM which is capable of running an operating system) and process VM (the one that only can run one executable, roughly saying). Anyway, if you are just interested in the definition of the term, you better go here

There are tones of articles dedicated to this matter on the Internet, hundreds of tutorials and explanations. I see no reason to just add another "trivial" article or tutorial to the row. Instead, I think it may be more interesting to see it in action, to have an example of real application. One may say that we are surrounded by those examples - Java, .NET, etc. It is correct, however, I would like to touch a slightly different application of this technology - protect your software/data from being hacked.

Data Protection
Millions of dollars are being spent by software (or content) vendors in an attempt to protect their products from being stolen or used in any other illegal way. There are numerous protection tools and utilities, starting with simple packers/scramblers and ending with complex packages that implement multilevel encryption and virtual machines as well. However, you may disagree, but you wont convince me, an out-of-the-box solution is good until it gains popularity. There is enough evidence for this statement. In my opinion, no one can protect your software better than you. It only depends on how much protected you want it to be.

Although, there are numerous protection methods and techniques, we are going to concentrate on a virtual machine for data coding/decoding. Nothing special, just a trivial XOR method, but, in my opinion, enough to demonstrate the fundamentals.

Design Your VM
While in real life, hardware design precedes its software counterpart, we may let ourselves to do it in reverse order (it is our own VM, after all). Therefore, we will begin with the pseudo executable file format which will be supported by our VM. 

Pseudo Executable File Format
Well, it is a good idea to put a header in the beginning of the file. In order to do so, we have to think what our file is going to contain. The file may be a raw code (remember DOS com files?), but this would not be interesting enough. So, let our file be divided into three sections:
  • code section - this section would contain code written in our pseudo assembly language (we'll cover it a bit later);
  • data section - this section would contain all the data needed by our pseudo executable (PE :-) );
  • export section - this section would contain references to all the elements that we want to make visible to the core program.
Let us define the header as a C structure:

typedef struct _VM_HEADER
   unsigned int  version; /* Version of our VM. Will be 0x101 for now */
   unsigned int  codeOffset; /* File offset of the code section */
   unsigned int  codeSize; /* Size of the code section in bytes */
   unsigned int  dataOffset; /* File offset of the data section */
   unsigned int  dataSize; /* Size of the data section in bytes */
   unsigned int  exportOffset; /* File offset of the export section */
   unsigned int  exportSize; /* Size of the export section in bytes */
   unsigned int  requestedStack; /* Required size of stack in 4 bytes blocks */
   unsigned int  fileSize; /* Size of the whole file in bytes */

Well, one more thing. Actually the most important one. We need a compiler for our pseudo assembly that would be able to output files of this format. Fortunately, we do not have to write one (although, this may be an interesting task). Tomasz Grysztar has done a wonderful work with his Flat Assembler. Despite the fact, that this compiler is intended to compile Intel assembly code, thanks to the wonderful macro instruction support, we can adopt it to our needs. The skeleton source for our file would look like this:

include 'defs.asm' ;Definitions of our pseudo assembly instructions

org 0
; Header =======================
   h_version   dd 0x101
   h_code      dd _code
   h_code_size dd _code_size
   h_data      dd _data
   h_data_size dd _data_size
   h_exp       dd _export
   h_exp_size  dd _export_size
   h_stack     dd 0x40
   h_size      dd size

; Code =========================
      ;some pseudo code here
_code_size = $ - _code

; Data =========================

   ;some data here

_data_size = $ - _data

; Export =======================
   ;export table structures here

_export_size = $ - _export
size = $ - h_version

as simple as that.

Export section deserves special attention. I tried to make it as easy to use as possible. It is divided into two parts:

  1. Array of file offsets of export entries terminated by 0;
  2. Export entries:
    1. File offset of the exported function/variable (4 bytes);
    2. Public name of the exported object (NULL terminated ASCII string);
In the above example, the export section would look like this:

; Array of file offsets
      dd  _f1  ; Offset of '_f1' export entry
      dd  0    ; Terminating 0
; List of export entries
  _f1 dd  _function                  ; File offset
      db  'exported_function_name',0 ; Public name

Save the file as 'something.asm' or whatever name you prefer. Compile it with Fasm.

Pseudo Assembly Language
Now, when we are done with the file format, we have to define our pseudo assembly language. This includes both definition of commands and instruction encoding. As this VM is designed to only code/decode short text message, there is no need to develop full scale set of commands. All we need is MOV, XOR, ADD, LOOP and RET

Before you start writing macros that would represent these commands, we have to think about instruction encoding. This is not going to be difficult - we are not Intel. For simplicity, all our instructions will be two bytes long followed by one or more immediate arguments if there are any. This allows us to encode all the needed information, such as opcode, type of arguments, size of arguments and operation direction:
typedef struct _INSTRUCTION
   unsigned short opCode:5;    /* Opcode value */
   unsigned short opType1:2;   /* Type of the first operand if present */
   unsigned short opType2:2;   /* Type of the second operand if present */
   unsigned short opSize:2;    /* Size of the operand(s) */
   unsigned short reg1:2;      /* Index of the register used as first operand */
   unsigned short reg2:2;      /* Index of the register used as second operand */
   unsigned short direction:1; /* Direction of the operation */


Define the following constants:

/* Operand types */

#define OP_REG 0 /* Register operand */
#define OP_IMM 1 /* Immediate operand */
#define OP_MEM 2 /* Memory reference */
#define OP_NONE 3 /* No operand (optional) */

/* Operand sizes */
#define _BYTE 0
#define _WORD 1
#define _DWORD 2

/* Operation direction */
#define DIR_LEFT  0
#define DIR_RIGHT 1

/* Instructions (OpCodes) */
#define MOV     1
#define MOVI 7
#define ADD     2
#define SUB 3
#define XOR     4
#define LOOP 5
#define RET 6

It seems to me that there is no reason to put all the macros defining our pseudo assembly opcodes here, as it would be a waste of space. I will just put one here as an example. This will be the definition of MOV instruction:

Constants to be used with our pseudo assembly language

Macro defining the MOV instruction

As you can see in the code above, I've been lazy again and decided, that it would be easier to implicitly specify the size of the arguments, rather then writing some extra code to identify their size automatically. In addition, the name of the instruction tells what that specific instruction is intended to do. For example, mov_rm - moves value from memory to register and letters 'r' and 'm' tell what types of arguments are in use (register, memory). In this case, moving a WORD from memory to a register would look like this:

   mov_rm REG_A, address, _WORD

and the whole code section (currently contains only one function)  is represented by the image below:

loads address of the message as immediate value into B register; loads length of the message from address described by message_len into C register; iterates message_len times and applies XOR to every byte of the message. "mov_rmi" performs the same operation as "mov_rm" but the address is in the register specified as second parameter.

This is what the output looks like in IDA Pro:



Data and Export sections

Virtual Machine
Alright, now, when we have some sort of a "compiler", we may start working on the VM itself. First of all, let us define a structure, that would represent our virtual CPU:

typedef struct _VCPU
   unsigned int  registers[4]; /* Four registers */
   unsigned int  *stackBase;   /* Pointer to the allocated stack */
   unsigned int  *stackPtr;    /* Pointer to the current position in stack */
   unsigned int  ip;           /* Instruction pointer */
   unsigned char *base;        /* Pointer to the buffer where our pseudo 
                                  executable is loaded to */

registers - general purpose registers. There is no need for any additional register in this VM's CPU;
stackBase - pointer to the beginning of the allocated region which we use as stack for our VM;
stackPtr - this is our stack pointer;
ip - instruction pointer. Points to the next instruction to be executed. It cannot point outside the buffer containing our pseudo executable;
base - pointer to the buffer which contains our executable. You may say that this is the memory of our VM.

In addition, you should implement at least some functions for the following:
  1. allocate/free virtual CPU
  2. load pseudo executable into VM's memory and setup stack
  3. a function to retrieve either a file offset or normal pointer to an object exported by the pseudo executable
  4. a function to set instruction pointer (although, this may be done by directly accessing the ip field of the virtual CPU
  5. a function that would run our pseudo code.
In my case, the final source looks like this:

I decided not to cite the VM's code here as you should be able to write it yourself if the subject is interesting enough for you. Although, the code in this article does not contain any checks for correct return values, you should take care of them.

Although, this article describes a trivial virtual machine which is only able to encode/decode a fixed length buffer, the concept itself may serve you well in software/data protection as hacking into VM is several times harder then cracking native code.

One more thing to add. Our design allows us to call procedures provided by the pseudo executable, but there are several ways to allow the pseudo executable to "talk to us". The simplest (as it seems to me) is to implement interrupts.

I hope, I've covered it. Would appreciate comments and/or suggestions.

P.S. The encoded result would be "V{rrq2>Iqlrz?".

See you at the next post!

Monday, December 19, 2011

Listing Loaded Shared Objects in Linux

I have recently come across several posts on the Internet where guys keep asking for Linux analogs of Windows API. One of the most frequent one is something like "EnumProcessModules for Linux". As usual, most of the replies are looking like "why do you need that?" or "Linux is not Windows". Although, the last one is totally true, it is completely useless. As to "why do you need that?" - why do you care? Poor guy's asking a question here so let's assume he knows what he's doing.

I remember looking for something like this myself while working on some virtualization project for one of my previous employers. One thing I've learnt - once the question is out of ordinary (and people do not usually ask for Windows API replacements in Linux), there is a really good chance of getting tones of useless replies and blamed for being unclear. More then that, as long as it comes to Linux, most people do not really understand the difference between doing something in the shell and doing something in your program (as, unfortunately, many call shell scripts programs as well).

Well, enough crying here. Let's get to business. As usual, a note for nerds (non nerds are welcome to comment, leave suggestions, etc.)

  • the code in this article may not contain all necessary checks for invalid values;
  • yes, there are other ways of doing this;
  • you are going to mess with libc here, so be careful;
What are Modules (in this case)
In Linux the word "module" has a different meaning from what you've been used to in Windows. While in Windows this word means components of a process (main executable and all loaded DLLs), in Linux it refers to a part of the kernel (usually a driver). If this what you mean, then you probably want to enumerate loaded kernel modules and this is beyond the scope of this article. What we are going to do here, is to write to the terminal paths of all loaded shared objects (Linux analog of Windows DLL) and we are going to do it in a less common way just to see how things are organized internally. Just like we have LDR_MODULE structure in Windows, we have link_map structure in Linux. In both cases these structures describe loaded libraries (well, in Windows there's also a LDR_MODULE for the main executable).

link_map Structure
We do not need to know too much about this structure (for those interested - see "include/link.h" in your glibc sources). We may even define our own structure for that (a minimal one):

struct lmap
   void*    base_address;   /* Base address of the shared object */
   char*    path;           /* Absolute file name (path) of the shared object */
   void*    not_needed1;    /* Pointer to the dynamic section of the shared object */
   struct lmap *next, *prev;/* chain of loaded objects */

There is some more information in the original structure, but we do not need it for now.

Getting There
So we know what the link_map structure looks like and it is good. But how can we get there? let me assume that you are aware of dynamic linking. In Linux we have dl* functions:

dlopen - loads a shared object (LoadLibrary);
dlclose - unloads a shared object (FreeLibrary);
dlsym - gets the address of a symbol from the shared object (GetProcAddress).

The dlopen function returns a pseudo handle to the loaded shared object. These functions are declared in "dlfcn.h". You also have to explicitly link the dl library by passing -ldl to gcc.

While in Windows HANDLE is equal to the base address of the module, in Linux pseudo handle is in fact a pointer to the corresponding link_map structure. This means that getting to the head of the list of loaded modules is quite easy:

struct lmap* get_list_head(void* handle)
   struct lmap* retval = (struct lmap*)handle;
   while(NULL != retval->prev->path)
      retval = retval->prev;
   return retval;

Things are a bit more complicated if you do not intend to load any shared object. You will still have to use the dl library, though.

First of all, you will have to call dlopen, despite that fact that you are not going to load anything. Call it with NULL passed as first argument and RTLD_NOW as the second. The return value in this case would be the pseudo handle for the main executable (similar to GetModuleHandle(NULL) in Windows), but it would point to a different structure (to be honest, I've been too lazy to dig for it in libc sources) then link_map. This structure contains different pointers and we are particularly interested in the fourth one. This pointer points to a structure (which I was too lazy to dig for as well) with some other pointers/values and we are particularly interested (again) in the fourth one. This pointer, in turn, finally gets us to the first link_map structure. In my case, it is a structure which refers to Let's take a look at the procedure in C

struct something
   void*  pointers[3];
   struct something* ptr;

struct lmap* pl;
void* ph = dlopen(NULL, RTLD_NOW);
struct something* p = (struct something*)ph;
p = p->ptr;
pl = (struct lmap*)p->ptr;

List Loaded Objects
Now we are ready to list all loaded objects. Assume p is a pointer to the first link_map (in our case lmap) structure:

while(NULL != p)
   printf("%s\n", p->path);
   p = p->next;

In my case the output is (about three times less than in a Windows process ;-) ):

C'est tous. We are done. The mechanism described above may be used in order to either enumerate loaded shared objects or to get their handles. I personally used it for amusement.

Just remember, that in Linux, unlike Windows, handle to an object is not its base address, but the address of (pointer to) the corresponding link_map structure.

Hope this post was at least interesting (if not helpful). See you at the next!

Friday, December 16, 2011

Executable Code Injection the Interesting Way

So. Executable code injection. In general, this term is associated with malicious intent. It is true in many cases, but in, at least, as many, it is not. Being malware researcher for the most of my career, I can assure you, that this technique appears to be very useful when researching malicious software, as it allows (in most cases) to defeat its protection and gather much of the needed information. Although, it is highly recommended not to use such approach, sometimes it is simply unavoidable.

There are several ways to perform code injection. Let's take a look at them.

DLL Injection
The most simple way to inject a DLL into another process is to create a remote thread in the context of that process by passing the address of the LoadLibrary API as a ThreadProc. However, it appears to be unreliable in modern versions of Windows due to the address randomization (which is currently not true, but who knows, may be once it becomes real randomization).

Another way, a bit more complicated, implies a shell code to be injected into the address space of another process and launched as a remote thread. This method offers more flexibility and is described here.

Manual DLL Mapping
Unfortunately, it has become fashionable to give new fancy names to the old good techniques. Manual DLL Mapping is nothing more than a complicated code injection. Why complicated, you may ask - because it involves implementation of custom PE loader, which should be able to resolve relocations. Adhering the Occam's Razor principle, I take the responsibility to claim, that it is much easier and makes more sense to simply allocate memory in another process using VirtualAllocEx API and inject the position independent shell code. 

Simple Code Injection
As the title of this section states, this is the simplest way. Allocate a couple of memory blocks in the address space of the remote process using VirtualAllocEx (one for code and one for data), copy your shell code and its data into those blocks and launch it as a remote thread.

All the methods listed above are covered well on the Internet. You may just google for "code injection" and you will get thousands of well written tutorials and articles. My intention is to describe a more complex, but also a more interesting way of code injection (in a hope that you have nothing else to do but try to implement this).

Before we start:
Another note for nerds. 
  • The code in this article does not contain any security checks unless it is needed as an example.
  • This is not malware writing tutorial, so I do not care whether the AV alerts when you try to use this method.
  • No, manual DLL mapping is not better ;-).
  • Neither do I care about how stable this solution is. If you decide to implement this, you will be doing it at your own risk.
Now, let's have some fun.

Disk vs Memory Layout
Before we proceed with the explanation, let's take a look at the PE file layout, whether on disk or in memory, as our solution relies on that.

This layout is logically identical for both PE files on disk and PE files in memory. The only differences are that some parts may not be present in memory and, the most important for us, on disk items are aligned by "File Alignment" while in memory they are aligned by "Page Alignment" values, which, in turn may be found in the Optional Header. For full PE COFF format reference check here.

Right now, we are particularly interested in sections that contain executable code ((SectionHeader.characteristics & 0x20000020) != 0). Usually, the actual code does not fill the whole section, leaving some parts simply padded by zeros. For example, if our code section only contains 'ExitProcess(0)', which may be compiled into 8 bytes, it will still occupy FileAlignment bytes on disk (usually 0x200 bytes). It will take even more space in memory, as the next section may not be mapped closer than this_section_virtual_address + PageAlignement  (in this particular case), which means that if we have 0x1F8 free bytes when the file is on disk, we'll have 0xFF8 free bytes when the file is loaded in memory.
The "formula" to calculate available space in code section is next_section_virtual_address - (this_section_virtual_address + this_section_virtual_size) as virtual size is (usually) the amount of actual data in section. Remember this, as that is the space that we are going to use as our injection target.
It may happen, that the target executable does not have enough spare space for our shell code, but let this not bother you too much. A process contains more than one module (the main executable and all the DLLs). This means that you can look for spare space in the code sections of all modules. Why only code sections? Just in order not to mess too much with memory protection.

The first and the most important rule for shellcode - it MUST be position independent. In our case, this rule is especially unavoidable (if you may say so) as it is going to be spread all over the memory space (depends on the size of your shell code, of course). 

The second, but not less important rule - carefully plan your code according to your needs. The less space it takes, the easier the injection process would be.

Let's keep our shell code simple. All it would do is interception of a single API (does not matter which one, select whichever you want from the target executable's import section), and show a message box each time that API is called (you should probably select ExitProcess for interception if you do not want the message box popping up all the time).

Divide your shellcode into independent functional blocks. By independent, I mean that it should not have any direct or relative calls or jumps. Each block should have one data field, which would contain the address of the table containing addresses of all our functions (and data if needed). Such mechanism would allow us to spread the code all over the available space in different modules without the need to mess with relocations at all.

The picture on the left and the diagram below will help you to better understand the concept. 
Init - our initialization function. Once the code is injected, you would want to call this function as a remote thread.
Patch - this block is responsible for actually patching the import table with the address of our Fake.

The code in each of the above blocks will have to access Data in order to retrieve addresses of functions from other blocks.

Your initialization procedure would have to locate the KERNEL32.DLL in memory in order to obtain the addresses of LoadLibrary (yes, it would be better to use LoadLibrary rather then GetModuleHandle), GetProcAddress and VirtualProtect API functions which are crucial even for such a simple task as patching one API call. Those addresses would be stored in Data.

The Injector
While the shellcode is pretty trivial (at least in this particular case), the injector is not. It will not allocate memory in the address space of another process (if possible, of course). Instead, it will parse the the PEB (Process Environment Block) of the victim in order to get the list of loaded modules. Once that is done, it will parse section headers of every module in order to create list of available memory locations (remember, we prefer code sections only) and fill the Data block with appropriate addresses. Let's take a look at each step.

First of all, it may be a good idea to suspend the process by calling SuspendThread function on each of its threads. You may want to read this post about threads enumeration. One more thing to remember is to open the victim process with the following flags: PROCESS_VM_READ | PROCESS_VM_OPERATION | PROCESS_VM_WRITE | PROCESS_QUERY_INFORMATION | PROCESS_SUSPEND_RESUME in order to be able to perform all the following operations. The function itself is quite simple:

DWORD WINAPI SuspendThread(__in HANDLE hThread);

Don't forget to resume all threads with ResumeThread once the injection is done.

The next step would be calling the NtQueryInformationProcess function from the ntdll.dll. The only problem with it is that it has no associated import library and you will have to locate it with GetProcAddress(GetModuleHandle("ntdll.dll"), "NtQueryInformationProcess"), unless you have a way to explicitly specify it in the import table of your injector. Also, try LoadLibrary if the GetModuleHandle does not work for you.

NTSTATUS WINAPI NtQueryInformationProcess(
   __in      HANDLE ProcessHandle,
   __in      PROCESSINFOCLASS ProcessInformationClass, /* Use 0 in order to 
                                               get the PEB address */
   __out     PVOID ProcessInformation,  /* Pointer to the PROCESS_BASIC_INFORMATION
                                                       structure */
   __in      ULONG ProcessInformationLength, /* Size of the PROCESS_BASIC_INFORMATION
                                                     structure in bytes */
   __out_opt PULONG ReturnLength

   PVOID     ExitStatus;
   PPEB      PebBaseAddress;
   PVOID     AffinityMask;
   PVOID     BasePriority;
   ULONG_PTR UniqueProcessId;
   PVOID     InheritedFromUniqueProcessId;

The NtQueryInformationProces will provide you with the address of the PEB of the victim process. This post will explain you how to deal with PEB content. Of course, you will not be able to access that content directly (as it is in the address space of another process), so you will have to use WriteProcessMemory and ReadProcessMemory functions for that.

BOOL WINAPI WriteProcessMemory(
   __in   HANDLE   hProcess,
   __in   LPVOID   lpBaseAddress,  /* Address in another process */
   __in   LPCVOID  lpBuffer,  /* Local buffer */
   __in   SIZE_T   nSize,  /* Size of the buffer in bytes */
   __out  SIZE_T*  lpNumberOfBytesWritten

BOOL WINAPI ReadProcessMemory(
   __in   HANDLE   hProcess,
   __in   LPCVOID  lpBaseAddress, /* Address in another process */
   __out  LPVOID   lpBuffer,  /* Local buffer */
   __in   SIZE_T   nSize,  /* Size of the buffer in bytes */
   __out  SIZE_T*  lpNumberOfBytesRead

Due to the fact that you are going to deal with read only memory locations, you should call VirtualProtectEx in order to make those locations writable (PAGE_EXECUTE_READWRITE).  Don't forget to restore memory access permissions to PAGE_EXECUTE_READ when you are done. 

BOOL WINAPI VirtualProtectEx(
   __in  HANDLE hProcess,
   __in  LPVOID lpAddress, /* Address in another process */
   __in  SIZE_T dwSize,  /* Size of the range in bytes */
   __in  DWORD  flNewProtect, /* New protection */
   __out PDWORD lpflOldProtect

You may also want to change the VirtualSize of those sections of the victim process you used for injection in order to cover the injected code. Just adjust it in the headers in memory.

That's all folks. Let me leave the hardest part (writing the code) up to you this time. 

Hope this post was interesting and see you at the next.