If we can handle such a complexe target as PE files are we are facing the sad fact we can infect files on the Intel platform but we can never get outside this platform. Rare exception from this axiom is virus Esperanto (by Mr. Sandman published in 29A Nr. 2) which is the first of its kind, capable of speading on various platforms and processors. Glory goes to Mr. Sandman but unfortunately, this approach cannot be used for larger projects. Whole Esperanto's solution is based on presence of two parts - one for intel processors, the other for Macs, practically doubling the size of necessary code. It doesn't seem to be the ideal solution, let's image the 50 kB viral code for three processors and we well land somewhere around 150 kb maxivirus.
I would solve this problem using another approach. My approach would be more difficult (but not impossible) to code. I state here i am not ready to participate on such a project (no time and morale left). I would like to find some newbies or people ready to work hard. Idea is quite simple - we should carry the body in some kind of pre-compiled state, which should be easy translated to assembly language of every single target processor.
Imagine, we have C compilator, which produces output at the level between C and assembly languages. Between C and assembly means, that before code is assembled it has to be compiled by special C compiler. In fact code should be at the lowest level, it could be, because we need to assemble it for various architectures. Because of this code should be register and memory addressing mode independent. The one model i like the best is stack machine (uses RPL - reverse polish logic) with direct memory adressing mode (only value on top of the stack is a memory address). Of course, this means compiled "code" will be larger than regular intel code.
Resulting code for some processor could gain quite high variability this way (by every single translation could be another instructions or registers used). Also in the case resulting code will be close enough to code produced by C compilers - some standart stack frame, analogical using of stack, registers, variables and so on - this would be very hard to differenciate by heuristics without any further analysis. And it will be even harder (if not impossible) to distinguish between variants. This would make problem of use complicated (and unemulationable) polymorphic routines, decryptors and such a things redundant. The only one condition to be not a simple target is to have the "source" (which is by its nature more or less static) encoded and decode it only if need to replicate.
Of course, precompiled "source code" has to contain "assembler" for all supported processors. Assembler - as a heart of body - gives a virus it's variability and complexness, so detection is as hard as good is assembler. That's reason why virii can be very long. It will be not enough just 5kB like for a classic poly routines. That is reason why (probably) wouldn't be such a viruses spreaded by mailing. But besides of this code will be very similar to standart languages. You needn't to deal with infecting file in general, you can link your data area wherever you need so you need not to use writeable sections for code - what is in my opinion the strongest heuristic flag.
Real time compiling
Anoter posibility is compile code at run-time - you needn't to have whole code compiled in host file. You can compile it at time you need it. This may at least reduce a size the file is increased of. I am not sure if this is safe enough in order not to be visible but i think compilation is complex enough to slow emulation down, and may be makes scanning-speed unacceptable, so avers will have to find out new ways of detecting.
Another advantage is the BIG possibility of some modifications to the pre-compiled code. Because you exactly know what your code means and what kind of modifications can be performed on it. Because new one inherits it's code by parent, in 10 generations there can be a very big difference between existing variants. Just imagine block permutations (modules or just functions) and minor changes in code like c=a+b -> c=b+a. I think it is good enough to totaly change the look of virii from parent to child and not speaking even about differences between distant variants. And there are possible a bit more complex changes - of course it depends on source language and you.
Disadvantages - size
As i see it, main disadvantage is size. Because of a bit difficult technologies necessary to implement i don't even hope that resulting code will be smaller than 50kB, what is imho a bit problem in these days. At first you can't use mailing strategy to spread itself. It tooks some time to download 150kB of mails :-(. I heard that 300kB is nothing, and there are really coming medias with 100MBs throughput, but main limiting factor is floppy disk/internet and we still live in world, where 3kB/s is a high speed (33k6 modems are quite usual for use of internet from home).
There can be some problems on the interference level (level, where host file and virus are directly connected). We are not far enough to say it can be whole handled by compiler or in needs special handling with PE+platform dependend code. But it should not be a big problem.
And now some sci-fi:
Probably the first reason we start with all this stuff was to try how will genetics work in vx. And this gives you much better control over code modularision and generation of code. Our first idea was to create virii able to exchange modules with other one in order to optimize itself and adapt to current environment. This gives you much better probability to survive, but need to create environment with strong exchange of genes - what is difficult. And now to real world ...
And now some closing words. Main advantage of the pre-compiled code is possibility to cross-plastform infection. Besides this this approach opens another horizonts at least at the level of today poly engines and in the eternal 'game of hiding body' goes more to the direction of giving the virus body 'right color' than building 'bullet-proof' walls of anti code. This leads in no way to the lower variability of the code. Having this features this concept leads to the viruses which are TMC-like.
Another plus is the programming in HLL is more comfortable and faster, read more effective, not speaking of the base address independency :-).
Think about it !