I’ve been working on implementing that FProt paper I talked about in my last post. The basic idea is to scan the binary for known code chunks, and instead of generic emulation, implementing special handlers (or mini-emulators as the FProt paper describes them).
I implemented the Multi-Pattern Matching algorithm according to a paper by Wu and Manber http://webglimpse.net/pubs/TR94-17.pdf. Snort once used a variation of this algorithm known as “mwm” – modified Wu Manber. My implementation of the paper differs slightly from the original paper, and although I won’t outright say some details in the paper are wrong, I had to use my own interpretation of the algorithm for it all to work.
The algorithm is similar to the Boyer-Moore string matching algorithm http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm. The Wu-Manber algorithm uses a hash of the last characters of the buffer to match to give a list of possible matches. The bad character shift will be a minimum of shifts for each possible match given by the hash. If that’s confusing, then look at how the Boyer Moore algorithm works and it will make more sense.
The results of the pattern matching were quite good, with a very minor performance impact, which allowed me to continue implementing the rest of the FProt paper without too many worries about worst case performance if no known code chunks are found.
From an instruction trace of my sample packed binaries, I could fairly easily see where most of the time in execution was being spent. Basically I looked for code that was writing to memory where the decompressed binary would end up in. These would loop and be part of the decompression functions. I then looked for entry points (the code leading up to entering the loop) into that code and copied the complete decompression code into a flat text file.
I think the process of automatic extraction of decompression loops and checksum functions can potentially be automated. Certainly its fairly easy to see the memory access of a decompression loop by looking at the range of memory writes that occur from that location. Complete automation is something I will look into the future. It would be really nice to see automatically generated static unpackers, and I really think such a thing is possible.
I then decompiled the asm decompression code, and ported it to my emulator. This results in something resembling a static unpacker or mini-emulator. An interesting thing to note, is that many packers share similar code to handle decompression. aplib http://www.ibsensoftware.com/products_aPLib.html and minor variations of it are used in a number of packers I have. Infact, aplib is used in mew/fsg/rlpack/packman and probably others. It would have been nice to rip the code from the aplib sources, however, it’s not public domain so I decided to reverse it from the assembly in the packed binary. This is also advantageous because I couldn’t get perfect matches for all the packed samples I had against the aplib versions I had.
Alot of debugging followed, and for the one binary I was targeting (rlpack), I had to include another chunk of code for calculating a checksum of the libraries to get a good performance increase. But in the end, it all worked out quite well.
The result for rlpack was a performance increase of over 50% over complete generic emulation. Unpacking rlpack-calc.exe took over .52 seconds using generic emulation, and with the mini-emulator code, it took about .25 seconds. And remember that in the worst case and no code is identified, generic emulation occurs as usual.
There is still lots to do and many mini-emulators to write. The question if dynamic binary translation is a completely superior approach is quite open to me. I think using known code chunks and mini-emulators is faster, but at the price of invested development time.