Thursday, October 18, 2012

Method of Computer Virus Detection. Sad story of a patent application

It was quite a long time ago (an epoch ago by terms of software development). Around the end of 2005 and beginning of 2006. I was then working for Aladdin Knowledge Systems' eSafe unit as a computer  virus researcher (my first formal RE job). Detection methods were quite poor at that time, even heuristic ones (not that they are THAT good these days). There was quite a lot noise about the Morphine scrambler at that time and I was responsible for finding a proper solution for that issue by developing a reliable detection method. 

I have to admit - Morphine was quite an advanced scrambler at that time. A masterpiece, I should say. Standard methods, at least those used by eSage at that time did not work and required some changes to be made to the engine.

As this was about the only task assigned to me at that period, I decided to play a bit more with Morphine while waiting for the aforementioned changes to be made. 

It was so easy to identify Morphine's code by eye, but, somehow I could not fit the pattern into any programmatic method (of those used at the time, as I said). Well, there are plenty of expert systems and neural networks that mimic the path of decision making as it happens in our mind and there were such systems at the time. However, I was not yet aware of those and those I heard about looked quite complicated.

My decision was to try and build a simple system capable of recognition of logic patterns in the code. It is quite obvious, that different implementations of the same algorithm share the same logic, which appears in a form of at least opcode sequences, although the overall binary representation may be different even if you replace one register with another. This lead me to the simple system described below.

It is important to mention that all of  the following information is publicly available, so I do not violate any NDA or whatsoever.

Code Generalization
Our mind generalizes the disassembled code by extracting the relevant logic information. But how to do that in software? The solution is easier than I initially expected. I simply had to sort the opcodes by categories, assigning a numeric value to each category. For example, let's take three categories - stack, bitwise and flow control operations.  The following example shows two pieces of code, that are different on a binary level, but are completely identical logically:

Code #1                                Code #2                                      Generalized form
push  eax            push  edx                   0x0001
xor   eax, eax       xor   edx, edx              0x0002
pop   ebx            pop   ecx                   0x0001
ret                  jmp   dword[esp]            0x0003

As you can see - the two code snippets are identical logically, but are quite different if you try to compare them in compiled form. However, if you try to generalize those snippets, you will get the same result from the both.

This is really a basic explanation of the system. Besides, it has evolved since that time.

Automatic Signature Generation
The most pleasant thing about this system was its ability to extract signatures automatically. At that time, only two samples of the same malware were needed, right now - one is more than enough. However, let me concentrate on the method as it was initially presented.

As I mentioned - there was a need for two samples of the same malware. Their executable content was then generalized using the system described above into a couple of arrays of extracted categories which were compared one to another and all similarities were put into a separate list of potential signatures. Why potential? Just because at that stage any of them could be a signature of "legal" logic which might be found in any executable (e.g. library routine).

In order to eliminate such "false" signatures, the list was applied to a set of "clean" files and each potential signature found in any clean file was removed.

The very first test results showed that Morphine, a masterpiece of polymorphism may be recognized with a single logic signature (and I tested it on thousands of files scrambled with Morphine). Needless to say that the efficiency was as good for at least 95% of malware known at that time. Basically, that meant that the database of several tens of megabytes could be replaced with a list of several kilobytes.

What's sad about this?
My employer at that time - Aladdin Knowledge Systems applied for a patent. Several years later (I was working for some other firm already) I came to know that the application was denied by USPTO.  The reason was quite surprising... As I had a chance to read the correspondence of the patent attorney and the examiner, I discovered that the application was denied based on the comparison algorithms used to compare the sequences of categories, which had TOTALLY NOTHING to do with the idea itself, which was about the preprocessing of data (extraction of logic patterns)... Somehow, this method (despite all the excitement) never got implemented in the product either...

For those interested, the application may be found here.


Note: Only a member of this blog may post a comment.