Every asm coder wants to produce fast, short and efficient code. In asm language the things should be always pushed to the limits. Imagine, 58 bytes are enough to display color PCX file on the screen, 2 bytes can cause system restart etc...
Especially virus coders should produce tight and effective code, it is not very pleasant thing to see unoptimized code of viruses. Sometimes the optimisation can spare couple hundred bytes - good reason to deal with this interesting topic.

Index

0. What is optimization
Process of optimization is process of doing domething more efficient, more reliable, faster or better and less buggy. This in programming means faster execution of program, reducing of its size etc. Coding of viruses is mostly programming in assembly, thus producing of fast and short code. But everything can be done better, even assembly code can be coded shorter. Some tips are included in this tutorial.

A. Opening words
There are some things, you should really avoid in your programs, doesn't matter if it is virus or some utility. First of all remove all unnecessary code - in a good program there should be no NOPs. Another good tip is examining the jumps in program - try to find out, if some JMP NEAR can not be replaced by JMP short. Good test is trying to remove directive jumps in beginning of the source code. Forward refferences are bad for optimization they produce unnecessary NOPs in some cases.

Better result we get with multiple passes (/m switch):

Very bad idea idea is calculating in the code some weird value which could be calculated directly in the source code using operands and parenthesses.

B. Uninitialized data
Viruses often need some memory for e.g. loading original EXE header. This could be done with following code (similar code could be often seen in viruses)

Well, as you can seek if you place the uninicialized data in your virus somewhere inside the body size of the resulting code is larger as necessary. Therefore avoid placing uninicialized variables, structures or buffers in the code. All the mentioned things should be placed on the heap. To see example how it should not look like, just take a look on your COMMAND.COM - in the one of DOS 6.22 as well as of Windows 98 you will seek a lot of zeroes near the EOF. That code is especially poor optimized. Try to use something like

C. Register settings in various situations
There are some interesting situations like executing some EXE or COM file or booting the computer. As one can expect register setting is not random but very deterministic. Virus authors can take advantage from this default setting by testing some registers for its "on run" value. Take a look in tables 1 and 2 for the values which are set by MS-DOS for us.

Table 1 - COM and EXE files
Table 2 - At the boot time*
reg
default value
reg
default value
AX
0000h (usualy)
BP
091Ch
BX
0000h
SP
SP
CX
00FFh
DS
PSP
DX
PSP
ES
PSP
SI
IP
SS
SS
DI
SP
CS
CS (PSP in COM)
reg
default value
reg
default value
AX
0201h
BP
7C00h
BX
7C00h
SP
7BFCh
CX
0001h
DS
0
DX
0080h
ES
0
SI
0005h
SS
0
DI
7C00h
CS
0

* as to Ender's tests, this might varry from bios to bios, however it remains constant for one machine (or bios version). You can use it as some sort of snapshot taken at first boot time, and then used for example in decryptor. Antivirus will not be able at all to guess these numbers without reboot and it will not be able correctly decrypt poly virus in mbr. (Quite nice Ender's idea, isn't it?)

As we all know Turbo Debugger sets all general registers to 0. When we some perform operation like

This way virus can easy escape from debugger. But on good writtten emulation system we are without any chance because they (AV) know we know and they are ready. But we can easy top all the wannabies trying to "research" our virus and give them the fun they want to have.

D. Putting 0 in register

  1. Clearing whatever register
    The usual way of clearing a register is
      B8 0000                  mov     ax, 0    ; costs 3 bytes
      
      33 C0                    xor     ax, ax   ; is more optimised
      2B C0                    sub     ax, ax   ; only 2 bytes !

    This above mentioned approach can be used every time you need. Under Win32 is the above mentioned optimization more efficient, just take look at following piece of code:

      B8 00000000              mov eax, 0       ; costs 5 bytes
      33 C0                    xor eax, eax
      2B C0                    sub eax, eax     ; costs only 2 bytes
    Now we will take a sneak peek on the specific situations.

  2. Clearing AH
    Let's assume the situation we need to clear the AH register. Generally we will do it this way:
      B4 00                    mov     ah, 0    ; cost 2 bytes in DOS
      2A E4                    sub     ah, ah
      32 E4                    xor     ah, ah   ; is the same
    but if the AL walue is less than 80h we can save a byte (DOS)
      98                        cbw
    while under Windoze the cbw will be assembled as
      66 98                     cbw
  3. Clearing DX
    We use the same approach as with AH. If value in AX <8000h we can use
      99                        cwd
    to clear the DX
  4. When is CX clear?
    There is absolutely useless use code like
              ....
              loop some_label
              xor     cx, cx          ; or even worse - mov cx, 0

    because after the exit from some loop (i do not mean hard or conditional jump out of the loop) the CX register is already cleared as well as after the REP ??? operations.

E. Test if register is clear
  1. General situation
    Normal, uneducated coder will use this code:
      3D 0000                  cmp     ax, 0    ; which takes 3 bytes
      83 FB 00                 cmp     bx, 0
    but hard core programmer will for sure use
      0B C0                    or      ax, ax   ; this takes only 2 bytes
      0B DB                    or      bx, bx
    if the value can be discarded we can save another byte
      48                        dec     ax
      4B                        dec     bx       ; this is only 1 byte

    will set for us sign flag if the register is zero. Situation under Windoze is similar:

      83 F8 00                 cmp     eax, 0
      0B C0                    or      eax, eax
      0B C0                    or      eax, eax
      66| 0B C0                or      ax, ax   ; this is not default
                                                ; operand size
  2. The CX register
    In x86 processors CX register is used as counter, the instruction set supports this use.
      E3 ??                    jcxz     some_label

    jumps when CX is zero, we do not need comprare CX with zero. But especially demo coders should pay attention to the fact, this instruction is relatively slow.

F. Putting 0FFh in AH, 0FFFFh in DX and CX
If AL > 80h, cbw sets AH to 0FFh. If AX > 8000h, cwd put OFFFFh in DX. As for CX, after the exit from the loop is this register set to 0.

G. Test if registers are 0FFFFh
If we need to test if the 2 registers hold both 0FFFFh and value of one is not needed afterwards, we can easy do it with e.g.

instead of

H. Using EAX/AX/AL saves bytes
Opcodes for some types of instructions are shorter, when you use AX or AL instead of any other registers. This is the case of instruction like:

I. MOV vs. XCHG
Moving value from one register to another can be replaced with XCHG but only in cases the value in one of source register is not important. AX register as expected is here better solution too.

Following code is of equal size but you can even enlarge the code with :

J. 16 bit and 8 bit registers
I doubt there is any non-newbie coder who will use code ala

which takes 4 bytes, but doing the code above at once with takes only 3 bytes.

K. Registers and immediate operands
Immediate operands cost more bytes than use of registers. Just looka at the example below:

L. Segment registers playground

  1. Avoid using not default segments
    If you use not default segmet register for the operation, segment prefix will be generated, which add 1 byte to the size of the code.
  2. Moving from segment register to segment register
    We can't move directly value from one segment register to another. It has to be coded ala
      8C C8                    mov     ax, cs
      8E D8                    mov     ds, ax
    with length of 4 bytes but
      0E                       push    cs
      1F                       pop     ds
    takes only 2 bytes

M. The string instructions
Intel prepared for use instructions like LODS, STOS, MOVS, SCAS, CMPS to handle large amount of data. One has to know what this instructions do. Below are instruction and it equivalents.

LODS type instruction save 1 byte in comparision with move.

STOS type save 2 bytes in comparision with move. Therefore MOVS instruction save 3 bytes (is the same as LODS followed by STOS).

SCAS type instruction save 2 byte in comparition with CMP. CMPS instruction does the same as

thus we save 3 bytes with CMPS.

We dont have to omit REP prefixes - using this 1 byte sized instruction we can do miracles almost at no costs. In addition after REP ??? is CX set to 0

N. DEC/INC vs SUB/ADD plus SI/DI
Incrementing or decrementing 16bit register is only 1 byte in size. It is more efficient that INC/DEC 8 bit register, or adding or subtracting 1 from some adress or register.

Even if we need to add/sub 2 from register inc/sub twice is 1 byte shorter. For SI/DI if AX doesn't matter we can use some of string instruction for INC/DEC (depending on direction flag).

O. SHR/SHL
SHR/SHL instructions can be used for division/multipliing by 2/4/8 etc.. instead of DIv/MUL instructions. While by DIV/MUL we have to fill register with needed value, we can instead of

use just

for multiply al with 4 and save a byte here. If we are multiplying just by 2, we can save 2 bytes because

has size only 2 bytes.

P. Procedures
If the some code is used over and over again it is very clever to put such a piece of code in procedure. It could save some valuable bytes, but we have to say, in other cases this could also add some bytes to code. How could you decide if putting the code in procedure can save bytes ? Let the number of repeating use of some piece of code be N, its size will be S, the number of saved or lost bytes will be B. When the code is not part of procedure, its total size will be N*S. But when we put the code in the procedure, the resulting size of code will be

( S+1 )
+
N*3
+
A
this should be understand as
(ret + the size of code)
+
N*3 bytes for each call
+
difference
And the resulting equation is

from the formula above you can clearly see, that every repeated code which size is 3 bytes or less can not be replaced by procedure in the process of optimisation. When we know the size of code (which should be at least 4 bytes) and the number of repeating we can estimate the number of saved bytes. So, for code with length of 4 bytes we save 1 bytes, if the code is repeated 6 times and put in the procedure.
Next thing you can do with procedures, is using multiple entry points for the same procedure.

Q. Multiple pops/pusheh
Sometimes situation arises where you have need to repeat lot of pushes or pops couple of times. If you thinking goes in the direction "procedure" it goes the right way. But with push/pop instruction is one little problem - the instruction manipulates stack pointer as well as the CALL does. But the instruction set provides solution - the JMP register instruction. We can handle multiple pushes/pops like this

R. Object code
Some opcodes are in some memory models aren't accesible. Therefore to use some workaround. Most typical use of object code in virus if famous return to original dos handler:

Another nice use of object code is let's say in decryptor ala: but as the every copy of virus will have different key, we code this instruction as which will work perfectly.

S. Structure of code
This is really what is optimisation about. If you structure your code good you can save lot of bytes. If you are using lot of procedures, do not forget to check input and outpot registers. Sometimes it could be valuable to use register, which cost less bytes to handle in procedures. As and example for good structured code, here look at part of INT 21h handler for some hypotetic virus. Pay attention to the use of call instruction here:

Size of the code above is 26 bytes. Every added handled function adds just 3 further bytes. But when we use this virus typical sequence of CMP, JE/JNE, JMP the it will look like:

thus we have to count at least 8 bytes for every single handled function. If we want to hadle 10 functions, it will be as much as 80 bytes. Structured coding from previous example will fit the same function in only (26+3*8) = 50 bytes. 30 bytes saved, do i have to further explain all the pluses structured coding can bring to you?

T. Arithmetics with SIB (LEA)
There is another way on i386+ to simplify more complicated arithmetical operations. A SIB displacement of instruction (used for complex memory access) can be used for arithemtics as well using LEA instruction. SIB means: Scale, Index, Base which is principle of memory accesss in general form:

Scale is register (any 32bit) multiplied by 1, 2, 4 or 8; Index is another 32bit register, and finally Base is a raw offset. Base is a 32bit value as well, that takes another 4 bytes, but it might be used also without this constant.
You can use if for aritmetical operations, even more it is faster than comparisonable equivalents using multiplying or shifts, and you can perform a mutliplication for even non-standard values, like eax mul 9, for example:

As you can see, this lea-trick can be used in many circumstances, and I'll not list them all here, of cos.

Hope this little introduction helps you with some ideas, but you surely know: "truth is out there" - you need to optimize your code in more general context, not only on instruction base, but on register usage optimizing and stack usage as well, and there are many other things that can't be decribed in such a general way. May be it is for another article. But for now, go ahead and won't your code be pesimised...