**************************************************************** **************************************************************** ************ *********** ************ Chomsky Hierarchy and the Word Problem *********** ************ in Code Mutation *********** ************ by Second Part To Hell *********** ************ *********** **************************************************************** **************************************************************** 1) Introduction 2) Definitions 2.1) Formal Grammar and connection to code mutation 2.2) Word Problem and connection to virus detection 3) Chomsky Hierarchy 4) Chomsky Hierarchy in code mutation 4.1) Type-3 Grammar in code mutation and MetaPHOR 4.2) Type-2 Grammar as recursive mutation 4.3) Type-1 Grammar as macro function mutation 4.4) Type-0 Grammar as information annihilation 5) Formal Grammar of MetaPHOR 6) Conclusion 7) References 1) Inroduction Code Mutation (whether it is obfuscation of a decryption engine or full metamorphism) is a technique to prevent the detection of the computer virus using analytic scan algorithm. Formal grammar gives us the basis for a theoretical analysis of code mutation; therefore we must not neglect or ignore it. In 1999, Qozah has written an article called "POLYMORPHISM AND GRAMMARS" [1], analysing polymorphism (exaclty: a garbage generator) for the first time with this theoretical tool. Even his results for non-regular grammars are not correct, his article opened the door for a new way analysing virus technologies. In 2007, Eric Filiol wrote an article called "Metamorphism, Formal grammars and Undecidable Code Mutation" [2], dealing with the word problem and a description of a extended MetaPHOR mutation engine using a "Tzeitzin semi-Thue system" (a Type-0 Grammar). In this article you will first read shourtly about the formal grammar and word problem including the reason why this is important for virus authors. (Chapter 2) Then you will learn about the hierarchy of classes of formal grammars, the Chomsky Hierarchy and about its connection to code mutation including an analyse of the formal grammar of The Mental Driller's MetaPHOR mutation engine. Especially for type-1 grammars, you will see a surprising result. (Chapter 3-4) Later you will see the full formal grammar of MetaPHOR;a formalisation of all possible mutations rules. (Chapter 5) Before reading this article, you should read and understand the idea of formal language and grammar - for example read Qozah's article. 2) Definitions 2.1) Formal Grammar and connection to code mutation Definition 1: # A formal Grammar G is a 4-tuple G = (N, T, S, R) where: # + N is a set of non-terminal symbols # + T is an alphabeth of terminal symbols # + S is an element of N; is the start symbol # + R is the rewriting system Definition 2: # A word is a finite sequence of characters of an alphabeth T. Definition 3.1: # A universal language UL over an alphabeth T is a subset T* of all # words over that alphabeth. Definition 3.2: # A formal language L is a subset of UL, contains all words generated by # a formal grammar G := (N, T, S, R) * Formal language in code mutation: Contains all functions of the computer virus. * Word in code mutation: A word is a finite sequence of characters of the x86 instruction set (any possible function generatable with the x86 instructions). If the word is an element of L: It has a specific genotype (behaviour), but no specific phenotype (kernel code); it is a specific micro-function. * Non-terminal Symbol in code mutation: Function, which return (recursively) a specific terminal symbol depending on its behaviour information. Can not be replaced by rewriting system alone. ("a::= ..." does not exist; but "ab::=..." does) * Alphabeth of terminal symbols x86 instructions set * Rewriting system R in code mutation: The rules of mutation; how the mutation engine can change specific code or the decryption engine. Standard Example: G = (N, T, S, R) N = {A, B} T = {a, b} R = { S ::= bS | aA A ::= bS | aB | a B ::= bB | aB | a | b } Derivation: S => bS => bbS => bbaA => bbaaB => bbaabB ==> bbaaba Polymorphical Example: G = (N, T, S, R) N = {A, B} T = {a, b, c, x, y} a = "nop" b = "sub edx, 0" c = "push ebx"+"pop ebx" x = "XOR [EDI], AL" y = "INC AL" e = "" R = { S::=aS | bS | cS | xA A::=aA | bA | cA | yB B::=aB | bB | cB | e } Derivation: S => aS => acS => acxA => acxcA => acxcyB => acxcybB => acxcybe nop push ebx pop ebx XOR [EDI], AL push ebx pop ebx INC AL sub edx, 0 Metamorphical Example: G = (N, T, S, R) N = {A, B, C} T = {a, b, c, d, e, f} a = "mov reg, 0" b = "pop reg" c = "push 0" d = "mov reg2, 0"+"push reg2" e = "xor reg, reg" f = "sub reg, reg" R = { S ::= A | Bb | C | D A ::= a B ::= c | d C ::= e D ::= f } Derivation: S => Bb => db mov reg2, 0 push reg2 pop reg The polymorphical example have been already used in [1]. The metamorphic example is very trivial, but even very complex viruses as MetaPHOR use grammars like this (See Chapter 5). 2.2) Word Problem and connection to virus detection The word problem can be formulate in mathematics with group theory, but we will use the formulation of computability theory. The problem (a decidion problem) is to construct an algorithm to decide for a given word if it belongs to a formal language or not. In connection with computer virus detection that means: An anti virus program scans a file, and has to decide whether a file is infected or not. Or: It has to decide wether a specific code of the file is part of a viruscode. OR: It has to decide wether a specific word from universal language UL belongs to a formal language L generated by a formal grammar G := (N, T, S, R), which represents a polymorphic or metamorphic engine. Example: G = ({A, B}, {a, b, c}, S, T) T = { S ::= aA | bA | c A ::= aA | B B ::= bB | c } Is "aaabbbc" part of L? Yes, because: S => aA => aaA => aaaA => aaaB => aaabB => aaabbB => aaabbbB => aaabbbc Is "aabbca" part of L? No, because: L={ (a|b)(a)*(b)*c, c }, and the given word has no "c" at the end. 3) Chomsky Hierarchy The Chomsky Hierarchy is a containment hierarchy of classes of formal grammars.[3] All possible formal grammars are divided in Type-0 ... Type-3 grammars, where Type-3 is subset of Type-2 (is subset of Type-1 (is subset of Type-0)). Type 0: Recursively enumerable Restriction: no restrictions Example: S ::= aBBA AB ::= BA | Sc | C cBc ::= aaa | ab | c A ::= CBC B ::= cBca | cb | aB C ::= a | c Derivation: S => aBBA => acBcacbCBC => aabacbcBc => aabacbccbc Type 1: Context-sensitive Restriction: For all (w1 => w2): |w1|<=|w2| Example: S ::= aBBA AB ::= BA | Sc | C cBc ::= aaa A ::= CBC B ::= cBca | cb | aB C ::= a | c Derivation: S => aBBA => acBcacbCBC => aaaaacbccbc Type 2: Context-free Restriction: For all (w1 => w2): w1 is an element of non-terminal symbols Example: S ::= aBBA A ::= CBC B ::= cBca | cb | aB C ::= a | c Derivation: S => aBBA => acbcbCBC => acbcbacba Type 3: Regular Restriction: R = { A = aB, A = Ba, A = a } Example: S ::= aS | aA A ::= Bc B ::= Bc | c Derivation: S => aS => aaA => aaBc => aaBcc => aaccc 4) Chomsky Hierarchy in code mutation For an anti virus programm it is important that scanning a file does not monopolize CPU-speed for long time. We can describe the time usage of an algorithm with Landau notation O(n), where n is the amount of data or number of mutation. A simple sort algorithm has a O(n) performance: For a list of 20 entries it needs t=c*20; for a list of 2.000 entries it needs t=c*2.000. A usual algorithm for multiplication of a N x N matrix with a N Vector needs N(N-1)=N^2-N operations. The performance is O(n^2) A usual algorithm for multiplication of a N x N matrix with a N x N matrix needs N(N(N-1) operations. The performance is O(n^3). About the complexity of an algorithm, which solves the word problem for Chomsky Hierarchy: Type-3: The complexity of an algorithm is linear: O(n) Type-2: The complexity of an algorithm is at least cubical: O(n^3). Type-1: The complexity of an algorithm is exponential: O(k^n) Type-0: For Type-0 grammars the word problem is undecidable. [ Caution: Here we use an asymtotically description of the given problem: lim n -> infinity It's obvious that in practise this is not possible. Nevertheless a virus can reach very high n, i. e. with deep (infinity) recursive mutation. Virus Scanners have to detect all variants of a virus, so that asymtotically description can be used. ] 4.1) Type-3 Grammar in code mutation and MetaPHOR This is the lowest type of possible code mutation, but unfortunatly used in most polymorphic or metamorphic viruses. In Chapter 5 you can see that most parts of MetaPHOR's grammar is Type-3. For example: N02 ::= t003 | t004 N32 ::= t078 | M21N32 => N32 ::= t078 | t079N32 M43 ::= t125 | t126 A big disadvanage is, that the shrinker of MetaPHOR is the inverse of expander (obfuscator). "The expander is the part that undoes what the shrinker did. It performs *exactly* the reverse operation." [4]. This result presents the determinism in the engine. Code mutation engines have to be created in a way, that linear time complexity will be prevented. 4.2) Type-2 Grammar as recursive mutation When this type of grammar is used in a metamorphic virus, the operations for deciding the word problem increases at least cubically. As (rarly) done in MetaPHOR, creating a Type-2 grammar requires good planning. The rules have to be recursively defined. As example, the rewriting rule N24 has a recursion level > 5, maybe some rules have even higher recursive level. Because of that, Mental Driller has put a recursivity control for preventing the code to grow too much. Such infitive recursive rules are our goal, for increasing the complexity of the word problem. 4.3) Type-1 Grammar as macro function mutation Using this formal grammar in a virus, you can mutate macro functions, not just the micro-functions aka. x86 instructions. For example: ABCD ::= EFGHI A ::= "PUSH [hWnd]" | ... B ::= "PUSH lpString" | ... C ::= "PUSH 0xFF" | ... D ::= "CALL GetWindowText | ... E ::= "PUSH [hWnd]" | ... F ::= "PUSH WM_GETTEXT" | ... G ::= "PUSH 0xFF" | ... H ::= "PUSH lpString" | ... I ::= "CALL SendMessage" | ... Beside of the extremly increased complexity of the virus, the word problem will become not decideable in available time. Combining this macro-function mutation with code integration to the host file, the virus can use parts of the host code and parts or its own code for mutation. 4.4) Type-0 Grammar as information annihilation When you use shrinking-rules, a Type-0 Grammar may be created. Using the inverse rules of the obfuscator, these rules have no effect. For a valueable usage of deletion rules, the existence of big words is a condition, otherwise the virus code may loose information. 5) Formal Grammar of MetaPHOR G = (N, T, S, R) N = { N01, ..., N47, M01, ..., M49 } T = { code of MetaPHOR } R = { S ::= [ T, such that the hardcoded X86 instructions t000-t135 are replaced by the behaviour functions N01..M49 ] N01 ::= t001 | t002 | M24N36 N02 ::= t003 | t004 N03 ::= t000 | t005 | N10 | t007 | t008 | t009 | t010 | t011 t028 | M03M06 | M06M03 | t090 | t091 | M34M35 N04 ::= t012 | t013 N05 ::= t014 | t015 N06 ::= t016 | N09 | t018 | t019 | t020 N07 ::= t021 | t022 | t023 N08 ::= t024 | t025 | t026 | t027 N09 ::= t029 | t030 | M01M02 N10 ::= t031 | t032 N11 ::= t033 | t034 N12 ::= t035 | t036 N13 ::= t037 | t038 N14 ::= t041 | M01M03 N15 ::= t043 | t033 | M05M04 | N16M09 N16 ::= t046 | M05M03 N17 ::= t047 | M06M02 N18 ::= t049 | M07M03 N19 ::= t053 | N14N20 | N16M12N17 | N14M31N17 N20 ::= t054 | N18M15 N21 ::= t055 | N09N12 | N15N10 N22 ::= t056 | N15M10 N23 ::= t058 | N10N12 | N12N10 N24 ::= t059 | N19M11 | M12M13 N25 ::= t063 | N21M10 N26 ::= t065 | M14N12 N27 ::= t066 | N18M49 N28 ::= t069 | N18M16 N29 ::= t071 | N16M17 N30 ::= t073 | N16M18 | M05N38 N31 ::= t077 | N16N18 N32 ::= t078 | M21N32 N33 ::= t080 | M22M24 | M34M36 N34 ::= t083 | N01M24 | N10 N35 ::= t085 | N02M25 N36 ::= t087 | M24N01 | N10 N37 ::= t088 | M25M26 N38 ::= t092 | M03M18 N39 ::= t093 | N06M27 | M28M29 N40 ::= t097 | N16M30N17 N41 ::= t105 | t106 | N14M37 | N16M40 N42 ::= t109 | N16M38 N43 ::= t112 | N16M39 N44 ::= t121 | N18M42 N45 ::= t124 | N18M46 N46 ::= t134 | M46M47M48 N47 ::= t135 | M48M47M46 M01 ::= t039 | N14M06 M02 ::= t040 | M03N17 M03 ::= t042 | M08N18 M04 ::= t044 M05 ::= t045 | N16M06 M06 ::= t048 M07 ::= t050 M08 ::= t051 M09 ::= t052 M10 ::= t057 M11 ::= t060 M12 ::= t061 | N18M33N18 M13 ::= t062 M14 ::= t067 M15 ::= t068 M16 ::= t070 M17 ::= t072 | N18M19 | N17N26 M18 ::= t074 | N18M20 M19 ::= t075 M20 ::= t076 M21 ::= t079 M22 ::= t081 M24 ::= t084 | N01N34 M25 ::= t086 | N02N35 M26 ::= t089 | M25N37 M27 ::= t094 M28 ::= t095 M29 ::= t096 M30 ::= t098 M31 ::= t099 | N18M32N18 M32 ::= t100 M33 ::= t101 M34 ::= t102 M35 ::= t103 M36 ::= t104 M37 ::= t107 | t108 | N18M41 M38 ::= t110 | t111 M39 ::= t113 | t114 M40 ::= t115 | t116 | t117 | t118 | N18M44 | N18M45 M41 ::= t119 | t120 M42 ::= t122 | t123 M43 ::= t125 | t126 M44 ::= t127 | t128 M45 ::= t129 | t130 M46 ::= t131 M47 ::= t132 M48 ::= t133 M49 ::= t064 } t000="NOP" t001="XOR Reg,-1" t002="NOT Reg" t003="XOR Mem,-1" t004="NOT Mem" t005="MOV Reg,Reg" t007="ADD Mem,0" t008="OR Reg,0" t009="OR Mem,0" t010="AND Reg,-1" t011="AND Mem,-1" t012="SUB Reg,Imm" t013="ADD Reg,-Imm" t014="SUB Mem,Imm t015="ADD Mem,-Imm t016="XOR Reg,0" t018="AND Reg,0" t019="XOR Reg,Reg" t020="SUB Reg,Reg" t021="XOR Mem,0" t022="MOV Mem,0" t023="AND Mem,0" t024="CMP Reg,0" t025="OR Reg,Reg" t026="AND Reg,Reg" t027="TEST Reg,Reg" t028="MOV Mem,Mem" t029="MOV Reg,Imm" t030="LEA Reg,[Imm]" t031="ADD Reg,Imm" t032="LEA Reg,[Reg+Imm]" t033="LEA Reg2,[Reg]" t035="LEA Reg,[Reg+Reg2]" t036="ADD Reg,Reg2" t037="LEA Reg,[Reg2+Reg2+xxx]" t038="LEA Reg,[2*Reg2+xxx]" t039="PUSH Imm" t040="POP Reg" t041="MOV Mem,Imm" t042="POP Mem" t043="MOV Reg2,Reg" t044="POP Reg2" t045="PUSH Reg" t046="MOV Mem,Reg" t047="MOV Reg,Mem" t048="PUSH Mem" t049="MOV Mem2,Mem" t050="PUSH Mem2" t051="POP Mem2" t052="MOV Reg2,Mem t053="OP Reg,Imm" t054="OP Reg,Mem" t055="LEA Reg,[Reg2+Imm]" t056="LEA Reg,[Reg2+Reg3]" t057="ADD Reg,Reg3" t058="LEA Reg,[Reg+Reg2+Imm]" t059="OP Reg,(Imm OP Imm2)" t060="OP Reg,Imm2" t061="OP Mem,Imm" t062="OP Mem,Imm2" t063="LEA Reg,[Reg2+Reg3+Imm]" t064="LEA Reg,[(RegX+)Reg2+Imm]" t065="LEA Reg,[(RegX+)2*Reg2+Imm]" t066="MOV Mem3,Mem" t067="MOV Mem3,Mem2" t068="OP Reg,Mem2" t069="MOV Mem,xxx" t070="MOV Mem2,xxx" t071="CALL Reg" t072="CALL Mem" t073="JMP Reg" t074="JMP Mem" t075="CALL Mem2" t076="JMP Mem2" t077="MOV Mem2,Reg" t078="MOV Reg,yyy" t079="OP Reg,xxx" t080="JMP @xxx" t081="Jcc @xxx" t083="ADD Reg,1" t084="NEG Reg" t085="ADD Mem,1" t086="NEG Mem" t087="ADD Reg,-1" t088="ADD Mem,-1" t089="NOT Mem" t090="CMP X,Y " - without Jcc t091="TEST X,Y" - without Jcc t092="RET" t093="MOVZX Reg,byte ptr [Mem]" t094="MOV Reg8,[Mem]" t095="MOV Reg,[Mem]" t096="AND Reg,0FFh" t097="OP Reg,Reg2" t098="OP Mem,Reg2" t099="OP Mem,Reg" t100="OP Mem2,Reg" t101="OP Mem2,Imm" t102="CMP Reg,Reg" t103="JO/JB/JNZ/JA/JS/JNP/JL/JG @xxx" - without Jcc t104="JNO/JAE/JZ/JBE/JNS/JP/JGE/JLE @xxx" - without Jcc t105="TEST Reg,Imm" + Jcc t106="CMP Reg,Imm" + Jcc t107="TEST Reg,Mem" + Jcc t108="CMP Reg,Mem" + Jcc t109="CMP Reg,Reg2" + Jcc t110="CMP Mem,Reg2" + Jcc t111="SUB Mem,Reg2" + Jcc t112="TEST Reg,Reg2" + Jcc t113="TEST Mem,Reg2" + Jcc t114="AND Mem,Reg2" + Jcc t115="CMP Mem,Imm" + Jcc t116="SUB Mem,Imm" + Jcc t117="TEST Mem,Imm" + Jcc t118="AND Mem,Imm" + Jcc t119="TEST Reg,Mem2" + Jcc t120="CMP Reg,Mem2" + Jcc t121="TEST Mem,Reg" + Jcc t122="TEST Mem2,Reg" + Jcc t123="AND Mem2,Reg" + Jcc t124="CMP Mem,Reg" + Jcc t125="CMP Mem2,Reg" + Jcc t126="SUB Mem2,Reg" + Jcc t127="TEST Mem2,Imm" + Jcc t128="AND Mem2,Imm" + Jcc t129="CMP Mem2,Imm" + Jcc t130="SUB Mem2,Imm" + Jcc t131="PUSH EAX" t132="PUSH ECX" t133="PUSH EDX" t134="APICALL_BEGIN" t135="APICALL_END" 6) Conclusion Efficient Code Mutation has always been a goal for virus writers. The analysis of these techniques on a mathematical background is an important venture to increase the complexity of mutation techniques. The word problem and its connection to the antiviral detection problem gives just the right basis for our work. Creating a metamorphic mutation engine needs very much time, and without using a type-2, type-1 or even type-0 grammar, the detection is still "trivial". Creating a concept for a valueable formal grammar of mutation engine does not need much time, but increases the complexity alot - the complexity may become cubical or even exponential. My final advice: Invest some hours for planning a efficient grammar, and as result your engine will be harder to detect by analytic scanning. - - - - - - - - - - - - - - Second Part To Hell www.spth.de.vu written in October 2008 - - - - - - - - - - - - - - 7) References [1] Qozah; "Polymorphism and Grammars"; 29A E-Zine issue 4; 1999. [2] Filiol, Eric; "Metamorphism, Formal grammars and Undecidable Code Mutation"; International Journal of Computer Science, vol. 2, number 1, 2007, pp. 70-75; 2007. [3] Chomsky, Noam; "Three models for the description of language"; IRE Transactions on Information Theory (2): 113-124; 1956. [4] The Mental Driller; "Metamorphism in practice"; 29A E-Zine issue 6; 2002.