****************************************************************
****************************************************************
************ ***********
************ Chomsky Hierarchy and the Word Problem ***********
************ in Code Mutation ***********
************ by Second Part To Hell ***********
************ ***********
****************************************************************
****************************************************************
1) Introduction
2) Definitions
2.1) Formal Grammar and connection to code mutation
2.2) Word Problem and connection to virus detection
3) Chomsky Hierarchy
4) Chomsky Hierarchy in code mutation
4.1) Type-3 Grammar in code mutation and MetaPHOR
4.2) Type-2 Grammar as recursive mutation
4.3) Type-1 Grammar as macro function mutation
4.4) Type-0 Grammar as information annihilation
5) Formal Grammar of MetaPHOR
6) Conclusion
7) References
1) Inroduction
Code Mutation (whether it is obfuscation of a decryption engine or full
metamorphism) is a technique to prevent the detection of the computer
virus using analytic scan algorithm. Formal grammar gives us the basis
for a theoretical analysis of code mutation; therefore we must not
neglect or ignore it.
In 1999, Qozah has written an article called "POLYMORPHISM AND
GRAMMARS" [1], analysing polymorphism (exaclty: a garbage generator) for
the first time with this theoretical tool. Even his results for
non-regular grammars are not correct, his article opened the door for a
new way analysing virus technologies. In 2007, Eric Filiol wrote an
article called "Metamorphism, Formal grammars and Undecidable Code
Mutation" [2], dealing with the word problem and a description of a
extended MetaPHOR mutation engine using a "Tzeitzin semi-Thue system" (a
Type-0 Grammar).
In this article you will first read shourtly about the formal grammar
and word problem including the reason why this is important for virus
authors. (Chapter 2)
Then you will learn about the hierarchy of classes of formal grammars,
the Chomsky Hierarchy and about its connection to code mutation
including an analyse of the formal grammar of The Mental Driller's
MetaPHOR mutation engine. Especially for type-1 grammars, you will see
a surprising result. (Chapter 3-4)
Later you will see the full formal grammar of MetaPHOR;a formalisation
of all possible mutations rules. (Chapter 5)
Before reading this article, you should read and understand the idea
of formal language and grammar - for example read Qozah's article.
2) Definitions
2.1) Formal Grammar and connection to code mutation
Definition 1:
# A formal Grammar G is a 4-tuple G = (N, T, S, R) where:
# + N is a set of non-terminal symbols
# + T is an alphabeth of terminal symbols
# + S is an element of N; is the start symbol
# + R is the rewriting system
Definition 2:
# A word is a finite sequence of characters of an alphabeth T.
Definition 3.1:
# A universal language UL over an alphabeth T is a subset T* of all
# words over that alphabeth.
Definition 3.2:
# A formal language L is a subset of UL, contains all words generated by
# a formal grammar G := (N, T, S, R)
* Formal language in code mutation:
Contains all functions of the computer virus.
* Word in code mutation:
A word is a finite sequence of characters of the x86 instruction
set (any possible function generatable with the x86 instructions).
If the word is an element of L:
It has a specific genotype (behaviour), but no specific phenotype
(kernel code); it is a specific micro-function.
* Non-terminal Symbol in code mutation:
Function, which return (recursively) a specific terminal symbol
depending on its behaviour information. Can not be replaced by
rewriting system alone. ("a::= ..." does not exist; but "ab::=..."
does)
* Alphabeth of terminal symbols
x86 instructions set
* Rewriting system R in code mutation:
The rules of mutation; how the mutation engine can change specific
code or the decryption engine.
Standard Example:
G = (N, T, S, R)
N = {A, B}
T = {a, b}
R = {
S ::= bS | aA
A ::= bS | aB | a
B ::= bB | aB | a | b
}
Derivation:
S => bS => bbS => bbaA => bbaaB => bbaabB ==> bbaaba
Polymorphical Example:
G = (N, T, S, R)
N = {A, B}
T = {a, b, c, x, y}
a = "nop"
b = "sub edx, 0"
c = "push ebx"+"pop ebx"
x = "XOR [EDI], AL"
y = "INC AL"
e = ""
R = {
S::=aS | bS | cS | xA
A::=aA | bA | cA | yB
B::=aB | bB | cB | e
}
Derivation:
S => aS => acS => acxA => acxcA => acxcyB => acxcybB => acxcybe
nop
push ebx
pop ebx
XOR [EDI], AL
push ebx
pop ebx
INC AL
sub edx, 0
Metamorphical Example:
G = (N, T, S, R)
N = {A, B, C}
T = {a, b, c, d, e, f}
a = "mov reg, 0"
b = "pop reg"
c = "push 0"
d = "mov reg2, 0"+"push reg2"
e = "xor reg, reg"
f = "sub reg, reg"
R = {
S ::= A | Bb | C | D
A ::= a
B ::= c | d
C ::= e
D ::= f
}
Derivation:
S => Bb => db
mov reg2, 0
push reg2
pop reg
The polymorphical example have been already used in [1]. The metamorphic
example is very trivial, but even very complex viruses as MetaPHOR use
grammars like this (See Chapter 5).
2.2) Word Problem and connection to virus detection
The word problem can be formulate in mathematics with group theory, but
we will use the formulation of computability theory.
The problem (a decidion problem) is to construct an algorithm to
decide for a given word if it belongs to a formal language or not.
In connection with computer virus detection that means:
An anti virus program scans a file, and has to decide whether a file is
infected or not. Or: It has to decide wether a specific code of the file
is part of a viruscode.
OR: It has to decide wether a specific word from universal language UL
belongs to a formal language L generated by a formal grammar
G := (N, T, S, R), which represents a polymorphic or metamorphic engine.
Example:
G = ({A, B}, {a, b, c}, S, T)
T = {
S ::= aA | bA | c
A ::= aA | B
B ::= bB | c
}
Is "aaabbbc" part of L? Yes, because:
S => aA => aaA => aaaA => aaaB => aaabB => aaabbB => aaabbbB => aaabbbc
Is "aabbca" part of L? No, because:
L={ (a|b)(a)*(b)*c, c }, and the given word has no "c" at the end.
3) Chomsky Hierarchy
The Chomsky Hierarchy is a containment hierarchy of classes of formal
grammars.[3] All possible formal grammars are divided in Type-0 ...
Type-3 grammars, where Type-3 is subset of Type-2 (is subset of Type-1
(is subset of Type-0)).
Type 0: Recursively enumerable
Restriction:
no restrictions
Example:
S ::= aBBA
AB ::= BA | Sc | C
cBc ::= aaa | ab | c
A ::= CBC
B ::= cBca | cb | aB
C ::= a | c
Derivation:
S => aBBA => acBcacbCBC => aabacbcBc => aabacbccbc
Type 1: Context-sensitive
Restriction:
For all (w1 => w2): |w1|<=|w2|
Example:
S ::= aBBA
AB ::= BA | Sc | C
cBc ::= aaa
A ::= CBC
B ::= cBca | cb | aB
C ::= a | c
Derivation:
S => aBBA => acBcacbCBC => aaaaacbccbc
Type 2: Context-free
Restriction:
For all (w1 => w2): w1 is an element of non-terminal
symbols
Example:
S ::= aBBA
A ::= CBC
B ::= cBca | cb | aB
C ::= a | c
Derivation:
S => aBBA => acbcbCBC => acbcbacba
Type 3: Regular
Restriction:
R = { A = aB, A = Ba, A = a }
Example:
S ::= aS | aA
A ::= Bc
B ::= Bc | c
Derivation:
S => aS => aaA => aaBc => aaBcc => aaccc
4) Chomsky Hierarchy in code mutation
For an anti virus programm it is important that scanning a file does not
monopolize CPU-speed for long time. We can describe the time usage of an
algorithm with Landau notation O(n), where n is the amount of data or
number of mutation.
A simple sort algorithm has a O(n) performance: For a list of 20
entries it needs t=c*20; for a list of 2.000 entries it needs t=c*2.000.
A usual algorithm for multiplication of a N x N matrix with a N Vector
needs N(N-1)=N^2-N operations. The performance is O(n^2)
A usual algorithm for multiplication of a N x N matrix with a N x N
matrix needs N(N(N-1) operations. The performance is O(n^3).
About the complexity of an algorithm, which solves the word problem for
Chomsky Hierarchy:
Type-3: The complexity of an algorithm is linear: O(n)
Type-2: The complexity of an algorithm is at least cubical: O(n^3).
Type-1: The complexity of an algorithm is exponential: O(k^n)
Type-0: For Type-0 grammars the word problem is undecidable.
[ Caution:
Here we use an asymtotically description of the given problem:
lim n -> infinity
It's obvious that in practise this is not possible. Nevertheless a
virus can reach very high n, i. e. with deep (infinity) recursive
mutation. Virus Scanners have to detect all variants of a virus, so
that asymtotically description can be used.
]
4.1) Type-3 Grammar in code mutation and MetaPHOR
This is the lowest type of possible code mutation, but unfortunatly
used in most polymorphic or metamorphic viruses. In Chapter 5 you can
see that most parts of MetaPHOR's grammar is Type-3.
For example:
N02 ::= t003 | t004
N32 ::= t078 | M21N32 => N32 ::= t078 | t079N32
M43 ::= t125 | t126
A big disadvanage is, that the shrinker of MetaPHOR is the inverse of
expander (obfuscator). "The expander is the part that undoes what the
shrinker did. It performs *exactly* the reverse operation." [4]. This
result presents the determinism in the engine. Code mutation engines
have to be created in a way, that linear time complexity will be
prevented.
4.2) Type-2 Grammar as recursive mutation
When this type of grammar is used in a metamorphic virus, the
operations for deciding the word problem increases at least cubically.
As (rarly) done in MetaPHOR, creating a Type-2 grammar requires good
planning. The rules have to be recursively defined.
As example, the rewriting rule N24 has a recursion level > 5, maybe
some rules have even higher recursive level. Because of that, Mental
Driller has put a recursivity control for preventing the code to grow
too much.
Such infitive recursive rules are our goal, for increasing the
complexity of the word problem.
4.3) Type-1 Grammar as macro function mutation
Using this formal grammar in a virus, you can mutate macro functions,
not just the micro-functions aka. x86 instructions.
For example:
ABCD ::= EFGHI
A ::= "PUSH [hWnd]" | ...
B ::= "PUSH lpString" | ...
C ::= "PUSH 0xFF" | ...
D ::= "CALL GetWindowText | ...
E ::= "PUSH [hWnd]" | ...
F ::= "PUSH WM_GETTEXT" | ...
G ::= "PUSH 0xFF" | ...
H ::= "PUSH lpString" | ...
I ::= "CALL SendMessage" | ...
Beside of the extremly increased complexity of the virus, the word
problem will become not decideable in available time.
Combining this macro-function mutation with code integration to the
host file, the virus can use parts of the host code and parts or its
own code for mutation.
4.4) Type-0 Grammar as information annihilation
When you use shrinking-rules, a Type-0 Grammar may be created. Using
the inverse rules of the obfuscator, these rules have no effect. For a
valueable usage of deletion rules, the existence of big words is a
condition, otherwise the virus code may loose information.
5) Formal Grammar of MetaPHOR
G = (N, T, S, R)
N = { N01, ..., N47, M01, ..., M49 }
T = { code of MetaPHOR }
R = {
S ::= [ T, such that the hardcoded X86 instructions t000-t135
are replaced by the behaviour functions N01..M49 ]
N01 ::= t001 | t002 | M24N36
N02 ::= t003 | t004
N03 ::= t000 | t005 | N10 | t007 | t008 | t009 | t010 | t011
t028 | M03M06 | M06M03 | t090 | t091 | M34M35
N04 ::= t012 | t013
N05 ::= t014 | t015
N06 ::= t016 | N09 | t018 | t019 | t020
N07 ::= t021 | t022 | t023
N08 ::= t024 | t025 | t026 | t027
N09 ::= t029 | t030 | M01M02
N10 ::= t031 | t032
N11 ::= t033 | t034
N12 ::= t035 | t036
N13 ::= t037 | t038
N14 ::= t041 | M01M03
N15 ::= t043 | t033 | M05M04 | N16M09
N16 ::= t046 | M05M03
N17 ::= t047 | M06M02
N18 ::= t049 | M07M03
N19 ::= t053 | N14N20 | N16M12N17 | N14M31N17
N20 ::= t054 | N18M15
N21 ::= t055 | N09N12 | N15N10
N22 ::= t056 | N15M10
N23 ::= t058 | N10N12 | N12N10
N24 ::= t059 | N19M11 | M12M13
N25 ::= t063 | N21M10
N26 ::= t065 | M14N12
N27 ::= t066 | N18M49
N28 ::= t069 | N18M16
N29 ::= t071 | N16M17
N30 ::= t073 | N16M18 | M05N38
N31 ::= t077 | N16N18
N32 ::= t078 | M21N32
N33 ::= t080 | M22M24 | M34M36
N34 ::= t083 | N01M24 | N10
N35 ::= t085 | N02M25
N36 ::= t087 | M24N01 | N10
N37 ::= t088 | M25M26
N38 ::= t092 | M03M18
N39 ::= t093 | N06M27 | M28M29
N40 ::= t097 | N16M30N17
N41 ::= t105 | t106 | N14M37 | N16M40
N42 ::= t109 | N16M38
N43 ::= t112 | N16M39
N44 ::= t121 | N18M42
N45 ::= t124 | N18M46
N46 ::= t134 | M46M47M48
N47 ::= t135 | M48M47M46
M01 ::= t039 | N14M06
M02 ::= t040 | M03N17
M03 ::= t042 | M08N18
M04 ::= t044
M05 ::= t045 | N16M06
M06 ::= t048
M07 ::= t050
M08 ::= t051
M09 ::= t052
M10 ::= t057
M11 ::= t060
M12 ::= t061 | N18M33N18
M13 ::= t062
M14 ::= t067
M15 ::= t068
M16 ::= t070
M17 ::= t072 | N18M19 | N17N26
M18 ::= t074 | N18M20
M19 ::= t075
M20 ::= t076
M21 ::= t079
M22 ::= t081
M24 ::= t084 | N01N34
M25 ::= t086 | N02N35
M26 ::= t089 | M25N37
M27 ::= t094
M28 ::= t095
M29 ::= t096
M30 ::= t098
M31 ::= t099 | N18M32N18
M32 ::= t100
M33 ::= t101
M34 ::= t102
M35 ::= t103
M36 ::= t104
M37 ::= t107 | t108 | N18M41
M38 ::= t110 | t111
M39 ::= t113 | t114
M40 ::= t115 | t116 | t117 | t118 | N18M44 | N18M45
M41 ::= t119 | t120
M42 ::= t122 | t123
M43 ::= t125 | t126
M44 ::= t127 | t128
M45 ::= t129 | t130
M46 ::= t131
M47 ::= t132
M48 ::= t133
M49 ::= t064
}
t000="NOP"
t001="XOR Reg,-1"
t002="NOT Reg"
t003="XOR Mem,-1"
t004="NOT Mem"
t005="MOV Reg,Reg"
t007="ADD Mem,0"
t008="OR Reg,0"
t009="OR Mem,0"
t010="AND Reg,-1"
t011="AND Mem,-1"
t012="SUB Reg,Imm"
t013="ADD Reg,-Imm"
t014="SUB Mem,Imm
t015="ADD Mem,-Imm
t016="XOR Reg,0"
t018="AND Reg,0"
t019="XOR Reg,Reg"
t020="SUB Reg,Reg"
t021="XOR Mem,0"
t022="MOV Mem,0"
t023="AND Mem,0"
t024="CMP Reg,0"
t025="OR Reg,Reg"
t026="AND Reg,Reg"
t027="TEST Reg,Reg"
t028="MOV Mem,Mem"
t029="MOV Reg,Imm"
t030="LEA Reg,[Imm]"
t031="ADD Reg,Imm"
t032="LEA Reg,[Reg+Imm]"
t033="LEA Reg2,[Reg]"
t035="LEA Reg,[Reg+Reg2]"
t036="ADD Reg,Reg2"
t037="LEA Reg,[Reg2+Reg2+xxx]"
t038="LEA Reg,[2*Reg2+xxx]"
t039="PUSH Imm"
t040="POP Reg"
t041="MOV Mem,Imm"
t042="POP Mem"
t043="MOV Reg2,Reg"
t044="POP Reg2"
t045="PUSH Reg"
t046="MOV Mem,Reg"
t047="MOV Reg,Mem"
t048="PUSH Mem"
t049="MOV Mem2,Mem"
t050="PUSH Mem2"
t051="POP Mem2"
t052="MOV Reg2,Mem
t053="OP Reg,Imm"
t054="OP Reg,Mem"
t055="LEA Reg,[Reg2+Imm]"
t056="LEA Reg,[Reg2+Reg3]"
t057="ADD Reg,Reg3"
t058="LEA Reg,[Reg+Reg2+Imm]"
t059="OP Reg,(Imm OP Imm2)"
t060="OP Reg,Imm2"
t061="OP Mem,Imm"
t062="OP Mem,Imm2"
t063="LEA Reg,[Reg2+Reg3+Imm]"
t064="LEA Reg,[(RegX+)Reg2+Imm]"
t065="LEA Reg,[(RegX+)2*Reg2+Imm]"
t066="MOV Mem3,Mem"
t067="MOV Mem3,Mem2"
t068="OP Reg,Mem2"
t069="MOV Mem,xxx"
t070="MOV Mem2,xxx"
t071="CALL Reg"
t072="CALL Mem"
t073="JMP Reg"
t074="JMP Mem"
t075="CALL Mem2"
t076="JMP Mem2"
t077="MOV Mem2,Reg"
t078="MOV Reg,yyy"
t079="OP Reg,xxx"
t080="JMP @xxx"
t081="Jcc @xxx"
t083="ADD Reg,1"
t084="NEG Reg"
t085="ADD Mem,1"
t086="NEG Mem"
t087="ADD Reg,-1"
t088="ADD Mem,-1"
t089="NOT Mem"
t090="CMP X,Y " - without Jcc
t091="TEST X,Y" - without Jcc
t092="RET"
t093="MOVZX Reg,byte ptr [Mem]"
t094="MOV Reg8,[Mem]"
t095="MOV Reg,[Mem]"
t096="AND Reg,0FFh"
t097="OP Reg,Reg2"
t098="OP Mem,Reg2"
t099="OP Mem,Reg"
t100="OP Mem2,Reg"
t101="OP Mem2,Imm"
t102="CMP Reg,Reg"
t103="JO/JB/JNZ/JA/JS/JNP/JL/JG @xxx" - without Jcc
t104="JNO/JAE/JZ/JBE/JNS/JP/JGE/JLE @xxx" - without Jcc
t105="TEST Reg,Imm" + Jcc
t106="CMP Reg,Imm" + Jcc
t107="TEST Reg,Mem" + Jcc
t108="CMP Reg,Mem" + Jcc
t109="CMP Reg,Reg2" + Jcc
t110="CMP Mem,Reg2" + Jcc
t111="SUB Mem,Reg2" + Jcc
t112="TEST Reg,Reg2" + Jcc
t113="TEST Mem,Reg2" + Jcc
t114="AND Mem,Reg2" + Jcc
t115="CMP Mem,Imm" + Jcc
t116="SUB Mem,Imm" + Jcc
t117="TEST Mem,Imm" + Jcc
t118="AND Mem,Imm" + Jcc
t119="TEST Reg,Mem2" + Jcc
t120="CMP Reg,Mem2" + Jcc
t121="TEST Mem,Reg" + Jcc
t122="TEST Mem2,Reg" + Jcc
t123="AND Mem2,Reg" + Jcc
t124="CMP Mem,Reg" + Jcc
t125="CMP Mem2,Reg" + Jcc
t126="SUB Mem2,Reg" + Jcc
t127="TEST Mem2,Imm" + Jcc
t128="AND Mem2,Imm" + Jcc
t129="CMP Mem2,Imm" + Jcc
t130="SUB Mem2,Imm" + Jcc
t131="PUSH EAX"
t132="PUSH ECX"
t133="PUSH EDX"
t134="APICALL_BEGIN"
t135="APICALL_END"
6) Conclusion
Efficient Code Mutation has always been a goal for virus writers. The
analysis of these techniques on a mathematical background is an
important venture to increase the complexity of mutation techniques.
The word problem and its connection to the antiviral detection problem
gives just the right basis for our work.
Creating a metamorphic mutation engine needs very much time, and
without using a type-2, type-1 or even type-0 grammar, the detection is
still "trivial". Creating a concept for a valueable formal grammar of
mutation engine does not need much time, but increases the complexity
alot - the complexity may become cubical or even exponential.
My final advice: Invest some hours for planning a efficient grammar,
and as result your engine will be harder to detect by analytic scanning.
- - - - - - - - - - - - - -
Second Part To Hell
www.spth.de.vu
written in October 2008
- - - - - - - - - - - - - -
7) References
[1] Qozah; "Polymorphism and Grammars"; 29A E-Zine issue 4; 1999.
[2] Filiol, Eric; "Metamorphism, Formal grammars and Undecidable Code
Mutation"; International Journal of Computer Science, vol. 2,
number 1, 2007, pp. 70-75; 2007.
[3] Chomsky, Noam; "Three models for the description of language";
IRE Transactions on Information Theory (2): 113-124; 1956.
[4] The Mental Driller; "Metamorphism in practice"; 29A E-Zine issue 6;
2002.