In this article I would like to introduce my own view to scanning technologies for macros used in most common antiviruses.
Structure of WB6/VBA
The first thing you have to understand is how macro is stored in file. This is probably the most important reason why is scanning doing the way it is done. So let start. I will assume you know something about VBA (visual basic for applications), what is the most common language used for macros. I would like to focus to VBA, because it is more common than WB6 (word basic 6). VBA project is stored in its own folder. (you can easy take a look at it by using dfview - suplied with DevStudio). Each macro is stored in stream with a bit complex structure. Before macro is written to file it is compiled by built-in compiler to something like a stack-machine language. So a = b + c is something like this:
push b ; put b to stack push c ; put c to stack add ; pop b, c and put it's summary to stack pop a ; pop value and set a to it
Variables and function names are usually located in dictionary so it is a bit difficult to find them. In code are just pointers or indexes. Moreover dictionary is not in macro file, so if avir (and many do so) is lazy it just skips them.
Other important thing is that jumps and calls are not linked in file so it is a bit difficult to trace code flow.
Anoter aproach to scan VBA is to decode "source" suplied with macro. In this case avir have full source of code (with all names and so on), so it is enough to skip some headers or something more and CRC it.
WB6 is much simplier in structure. It's code is stored in tokens (?), where each token has exact meaning. For example space, +, -, SomeFunction, number, variable, and so on. So it is rather difficult to write emulator of this language. I don't like to waste time with this, so some example (the same case as I used for vba):
variable "a" operator '=' variable "b" operator '+' variable "c"
and some function:
internal function MacroCopy string "my_macro" operator ',' string "new_macro"
As you could easy find out this structure is very simple and that probably caused that most recent technologie is CRC.
Number of macro virii is growing day by day. This is a reason to use automatic systems to extract signatures. Because they were coming with WB6, and WB6 has very simple structure, it seemed that CRC is the best. It is "absolutely" exact - so it can recognise sub variants, it is fast and easy to implement. May be it seems silly, but it was in days when everybody trusted that polymorphism in macro is not possible (what a mistake :-). So scanning algorithm is something like this: CRC all macros and check in database. Look at macros you found and if they describes whole variant virii is identified exact. If something is missing and I think it is enough: possible virus. You can see even now the legacy of those old good days. Many variants are differ just in one tab or space after end of macro. CRC rulez... Suddenly polymorphic ones came to the light of world and CRC became not very efficient. Became not but it is very comfortable to just run some program on your collection and to have signatures. The simplies way you can see is just to modify CRC algorithm. In those days polymorphism was when name of variable changes or something like that, so why not to skip whitespaces or variable names. And this is probably the final solution. So called smart-crc. Because of structure of VBA you can afford to skip variable names (and it is even more comfortable - who will search for it in tables). The structure of code is stored in code. It is enough to check the structure. You can easily see that macro is doing something like ? = ? + ?. And this is enough. With this technique you can identify at least 90% of current virii. Finally if you want polymorphism you NEED to change the structure of code. For example swap lines or add garbage code. And automatic AV technologies will be smashed to do ground.
The next and very old way are scanstrings. It is easy to implement scanstring like scanner for macros. This may be exact because you can follow either structure of code or names. For example you may be looking for:
ToolsMacro .Name = something with .Edit
These are two very easy to implement but efficient in use methods.
Some word about heuristics
The main problem of macro virii is that there doesn't exist non virii macros (in fact they are very rare). So AV are not anxiety about how to detect, but how to distinguish between variants and how to find out macro is not virii :).
It is easy to follow functions macro is using, so write a macro virii heuristics is easy. There is not problem to say: "this may be macro virii", because you can easily find "ToolsMacro .Edit", "Organizer .Copy" and other shit macro must be dealing with. In fact it is not enough. There are many legal self-installing macros using macro copying functions. In virii you can find usually somthing special ... Very important flags for heuristics may be:
Because structure of VBA is a bit more complex it is hard to simulate it. In fact all heuristics i know about are passive (it means they are just searching for something) and i don't know about any big advantages in more complex analysis. In fact simulate whole macro is difficult task - try to write whole VBA... I have heard someone emulates variables, but i don't believe it. And i am almost sure there is no full emulator of vba. So something like this
a$="tomacro" b$="au" c$=b$+a$
is hardly ever suspicious for scanners. And I can't imagin that this can be suspicious:
.... a$="" b$="" 10 c$=b$+a$ if c$<>"" goto 20 a$="tomacro" b$="au" goto 10 20 ....
Of course this is silly example, but i like to point that without code following emulation it is useless to have emulation of variables. And you may use as many jumps as you want or even functions, cycles or something even more evil. If you want to make day of aver harder just assign variables way that they can't be simulated from top to down. It means use goto to turn back with new values and heuristics will have to be much more complicated.
So that is all i want to say ....