Solution Pro!.

profileJahleelN
a_methodology_for_testing_cpu_emulators.pdf

29

A Methodology for Testing CPU Emulators

LORENZO MARTIGNONI, Università degli Studi di Udine ROBERTO PALEARI, ALESSANDRO REINA, GIAMPAOLO FRESI ROGLIA, and DANILO BRUSCHI, Università degli Studi di Milano

A CPU emulator is a software system that simulates a hardware CPU. Emulators are widely used by computer scientists for various kind of activities (e.g., debugging, profiling, and malware analysis). Although no theoretical limitation prevents developing an emulator that faithfully emulates a physical CPU, writing a fully featured emulator is a very challenging and error prone task. Modern CISC architectures have a very rich instruction set, some instructions lack proper specifications, and others may have undefined effects in corner cases. This article presents a testing methodology specific for CPU emulators, based on fuzzing. The emulator is “stressed” with specially crafted test cases, to verify whether the CPU is properly emulated or not. Improper behaviors of the emulator are detected by running the same test case concurrently on the emulated and on the physical CPUs and by comparing the state of the two after the execution. Differences in the final state testify defects in the code of the emulator. We implemented this methodology in a prototype (named as EmuFuzzer), analyzed five state-of-the-art IA-32 emulators (QEMU, Valgrind, Pin, BOCHS, and JPC), and found several defects in each of them, some of which can prevent proper execution of programs.

Categories and Subject Descriptors: D.2.5 [Software Engineering]: Testing and Debugging; D.2.8 [Software Engineering]: Metrics—Complexity measures, performance measures

General Terms: Reliability, Verification

Additional Key Words and Phrases: Software testing, fuzzing, emulation, automatic test generation

ACM Reference Format: Martignoni, L., Paleari, R., Reina, A., Roglia, G. F., and Bruschi, D. 2013. A methodology for testing CPU emulators. 2013 ACM Trans. Softw. Eng. Methodol. 22, 4, Article 29 (October 2013), 26 pages. DOI: http://dx.doi.org/10.1145/2522920.2522922

1. INTRODUCTION

In Computer Science, the term “emulator” is typically used to denote a piece of soft- ware that simulates a hardware system [Lichstein 1969]. Different hardware sys- tems can be simulated: a device [Google Inc. 2011], a CPU (Pin [Luk et al. 2005] and Valgrind [Nethercote 2004]), and even an entire PC system (QEMU [Bellard 2005], BOCHS [Lawton 1996], JPC [Preston et al. 2007], and Simics [Magnusson et al. 2002]). Emulators are widely used today for many applications: development, debugging, pro- filing, security analysis, etc. For example, the NetBSD AMD64 port was initially devel- oped using an emulator [netbsd64 2011].

This article is a revised and extended version of the conference paper “Testing CPU Emulators”, presented at ISSTA 2009. Authors’ addresses: L. Martignoni, Dipartimento di Fisica, Università degli Studi di Udine, Italy; email: [email protected]; R. Paleari, A. Reina, G. F. Roglia, and D. Bruschi, Dipartmento de Informatica, Università degli Studi di Milano, Italy; email: {roberto, gianz, bruschi}@security.dico.unimi.it; [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c© 2013 ACM 1049-331X/2013/10-ART29 $15.00

DOI: http://dx.doi.org/10.1145/2522920.2522922

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

29:2 L. Martignoni et al.

The Church-Turing thesis implies that any effective computational method can be emulated within any other. Consequently, any hardware system can be emulated via a program written with a standard programming language. Despite the absence of any theoretical limitation that prevents the development of a correct and complete emulator, from the practical point of view, the development of such a software is very challenging. This is particularly true for CPU emulators, that simulate a physical CPU. Indeed, the instruction set of a modern CISC CPU is very rich and complex. Moreover, the official documentation of CPUs often lacks the description of the semantics of certain instructions in certain corner cases and sometimes contains inaccuracies (or ambiguities). Although several good tools and debugging techniques exist [Myers 1978], developers of CPU emulators have no specific technique that can help them to verify whether their software emulates the CPU by following precisely the specification of the vendors. As CPU emulators are employed for a large variety of applications, defects in their code might have cascading implications. Imagine, for example, what consequences the existence of any defect in the emulator used for porting NetBSD to AMD64 would have had on the reliability of the final product.

Assuming that the physical CPU is correct by definition, the ideal CPU emulator has to mimic exactly the behavior of the physical CPU it is emulating. On the contrary, an approximate emulator deviates, in certain situations, from the behavior of the physical CPU. There are particular examples of approximate emulators in literature [Ferrie 2006; Ormandy 2007; Quist and Smith 2006; Rutkowska 2004; Raffetseder et al. 2007]. Our goal is to develop a general automatic technique to discover deviations between the behavior of an emulator and of the corresponding physical CPU. In particular, we are interested in investigating deviations (i.e., state of the CPU registers and contents of the memory) which could modify the behavior of a program in an emulated environment. On the other hand, we are not interested in deviations that lead only to internal differences in the state (e.g., differences in the state of CPU caches), because these differences do not affect the behavior of the programs running inside the emulated environment.

In this article, we present a fully automated and black-box testing methodology for CPU emulators, based on fuzzing [Miller et al. 1990]. Roughly speaking such a methodology works as follows. Initially, we automatically generate a very large number of test cases. Strictly speaking, a test case is a single CPU instruction together with an initial environment configuration (CPU registers and memory contents); a more formal definition of a test case is given in Section 3.3. These test cases are subsequently executed both on the physical CPU and on the emulated CPU. Any difference detected in the configurations of the two environments (e.g., register values or memory contents) at the end of the execution of a test case, is considered a witness of an incorrect behavior of the emulator. Given the unmanageable size of the test case space, we adopt two strategies for generating test cases: purely random test case generation and hybrid algorithmic/random test case generation. The latter guarantees that each instruction in the instruction set is tested at least in some selected execution contexts.

We have implemented this testing methodology in a prototype for IA-32, named as EmuFuzzer, and used it to test five state-of-the-art emulators: BOCHS [Lawton 1996], QEMU [Bellard 2005], Pin [Luk et al. 2005], Valgrind [Nethercote 2004], and JPC [Preston et al. 2007]. Although Pin and Valgrind are dynamic instrumentation tools, their internal architecture resembles, in all details, the architecture of traditional emulators and therefore they can suffer from the same problems. We found several deviations in the behaviors of each of the five emulators. Some examples of the deviations we found in these state-of-the-art emulators are reported in Table I1. As an example, let us consider the instruction add $0x1,(%eax), which adds the immediate

1In this article, we use IA-32 assembly and we adopt the AT&T syntax.

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

A Methodology for Testing CPU Emulators 29:3

Ta b le

I. E

xa m

p le

s o f

In st

ru ct

io n s

th a t

B e h a ve

D iff

e re

n tly

w h e n

E xe

cu te

d in

th e

P h ys

ic a lC

P U

a n d

W h e n

E xe

cu te

d in

a n

E m

u la

te d

C P

U (t

h a t

e m

u la

te s

a n

IA -3

2 C

P U

)

In st

ru ct

io n

IA -3

2 Q

E M

U V

a lg

ri n

d P

in B

O C

H S

J P

C l o c k

f c o s

il le

ga l

in st

r. l o c k

ig n

or ed

n o

d if

f. n

o d

if f.

n o

d if

f. l o c k

ig n

or ed

i n t 1

tr a p

n o

d if

f. il

le ga

l in

st r.

n o

d if

f. ge

n er

a l

p ro

t. fa

u lt

n ot

su p

p or

te d

f l d 1

f p u i p

= e i p

f p u i p

= 0

f p u i p

= 0

F P

U vi

rt u

a li

ze d

∗ n

o d

if f.

f p u i p

= 0

a d d

$ 0 x 1 , ( % e a x )

( % e a x )

= 0 x d 0

( % e a x )

= 0 x c f

n o

d if

f. n

o d

if f.

n o

d if

f. n

o d

if f.

p o p

% f s

% e s p

= 0 x b f d b b 1 0 8

n o

d if

f. n

o d

if f.

% e s p

= 0 x b f d b b 1 0 6

n o

d if

f. se

gm en

t n

ot p

re se

n t

p o p

0 x f f f f f f f f

% e s p

= 0 x b f f f f e 4 4

n o

d if

f. n

o d

if f.

n o

d if

f. % e s p

= 0 x b f f f f e 4 8

n o

d if

f.

F or

ea ch

in st

ru ct

io n

, w

e re

p or

t th

e b

eh av

io r

of th

e p

h ys

ic a

l C

P U

a n

d th

e b

eh av

io r

of th

e em

u la

to rs

(d if

fe re

n ce

s a

re h

ig h

li gh

te d

). ∗ P

IN vi

rt u

a li

ze s

th e

p h

ys ic

a l

F P

U , so

fl oa

ti n

g p

oi n

t in

st ru

ct io

n s

a re

ex ec

u te

d n

a ti

ve ly

ra th

er th

a n

b ei

n g

em u

la te

d .

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

29:4 L. Martignoni et al.

Fig. 1. Intel x86 instruction format.

0x1 to the byte pointed by the register eax. Assuming that the original value of the byte is 0xcf, the execution of the instruction on the physical CPU, and on four of the tested emulators, provides the result 0xd0. In QEMU, instead, the value is not updated correctly for a certain encoding of the instruction. We also discovered instructions that are correctly executed in the native environment but freeze QEMU and instructions that are not supported by Valgrind and thus generate exceptions. On the other hand, we also found instructions that are executed by Pin and BOCHS but that cause exceptions on the physical CPU. The results obtained witness the difficulty of writing a fully featured and specification-compliant CPU emulator, but also prove the effectiveness and importance of our testing methodology.

To summarize, this article makes the following contributions:

—a fully automated testing methodology, based on fuzz-testing, specific for CPU emu- lators;

—an optimized algorithm for test case generation that systematically explores the instruction set, while minimizing redundancy;

—a prototype implementation of our testing methodology for IA-32 emulators; —an extensive testing of five IA-32 emulators that resulted in the discovery of several

defects in each of them, some of which represent serious bugs.

This article is organized as follows. Section 2 briefly reviews the main fundamental features of the Intel IA-32 architecture. Section 3 introduces formally the notion of faithful emulation. Section 4 describes in detail our algorithms for test-case generation and how test cases are run to detect if an emulator is not emulating faithfully the CPU. Section 5 evaluates our methodology by presenting the results of the testing of five CPU emulators. Section 6 discusses limitations and future work. Finally, Section 7 presents the related literature, and Section 8 concludes.

2. IA-32 INTEL ARCHITECTURE

The IA-32 refers to a family of 32-bit Intel processors that are widely used in many multipurpose environments because of their facilities and performance. In this section, we provide a brief introduction to the IA-32 architecture. For further details, an interested reader can refer elsewhere [Intel 2008].

IA-32 is a CISC architecture, with an incredible number of different instructions and a complex encoding scheme. Instruction length can vary from 1 to 17 bytes. The format of an Intel x86 instruction is depicted in Figure 1. An instruction is com- posed of different fields: it starts with up to 4 prefixes, followed by an opcode, an ad- dressing specifier (i.e., ModR/M and SIB fields), a displacement and an immediate data field [Intel 2008]. Opcodes are encoded with one, two, or three bytes, but three extra bits of the ModR/M field can be used to denote certain opcodes. In total, the instruction set is composed of more than 700 possible values of the opcode field. The ModR/M field

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

A Methodology for Testing CPU Emulators 29:5

is used in many instructions to specify nonimplicit operands: the Mod and R/M subfields are used in combination to specify either registry operands or to encode addressing modes, while the Reg/Opcode subfield can either specify a register number or, as men- tioned before, additional bits of opcode information. The SIB byte is used with certain configurations of the ModR/M field, to specify base-plus-index or scale-plus-index ad- dressing forms. The SIB field is in turn partitioned in three subfields: Scale, Index, and Base, specifying respectively the scale factor, the index register, and the base register. Finally, the optional addressing displacement and immediate operands are encoded in the Displacement and Immediate fields respectively. Since the encoding of the ModR/M and SIB bytes is not trivial at all, the Intel ×86 specification provides tables describing the semantics of the 256 possible values each of these two bytes might assume. In conclusion, it is easy to see that elementary decoding operations, such as determining the length of an instruction, require decoding the entire instruction format and inter- preting the various fields correctly. In recent years, the advent of several instruction extensions (e.g., Multiple Math eXtension (MMX) and Streaming SIMD Extensions (SSE)) contributed to make the instruction set even more complicated.

The IA-32 architecture supports four basic operating modes: real-address mode, pro- tected mode, virtual-8086 mode, and system management mode. The operating mode of the processor determines which instructions and architectural features are available. Every operating mode implies a well-defined set of instructions and semantics, and some instructions behave differently depending on the mode. For example, instruc- tion can raise different exceptions and can update flags and registers differently when executed in the protected mode and when executed in the virtual-8086 mode.

Any task or program running on an IA-32 processor is given a set of resources for storing code, data, state information, and for executing instructions. These resources constitute the basic execution environment and they are used by both the operating system and users’ applications. The resources of the basic execution environment are identified as follows.

—Address Space. Any task or program can address a 32-bit linear address space. —Basic Program Execution Environment. The eight general-purpose registers (eax, ecx, edx, ebx, esp, ebp, esi, edi), the six segment registers (cs, ss, ds, es, fs, gs), the eflags register, and the eip register comprise a basic execution environment in which to execute a set of general-purpose instructions.

—Stack. This is used to support procedure or subroutine calls and the passing of parameters between procedure and subroutines.

—×87 FPU Registers. This set of registers provides an execution environment for floating point operations.

—MMX Registers and XMM Registers. These registers are used by dedicated instruc- tions designed for accelerating multimedia and communication applications.

In addition to these resources, the IA-32 architecture provides the following resources as part of its system-level architecture.

—I/O Ports. The IA-32 architecture supports a transfer of data to and from input/output ports;

—Control Register. The five control registers (cr0 through cr4) determine the operating mode of the processor and the characteristics of the currently executing task;

—Memory Management Register. The gdtr, idtr, task register, and ldtr specify the locations of data structures used in protected mode memory management;

—Debug Register. The debug registers (db0 through db7) control and allow monitoring of the processor’s debugging operations;

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

29:6 L. Martignoni et al.

—Memory-Type Range Registers. The memory type range registers are used to assign memory type to regions of memory such as: uncacheable, write combining, write through, write back, and write protected type;

—Machine-Specific Registers. The processor provides a variety of machine specific reg- isters (MSR) that are used to control and report on processor performance;

—Machine Check Registers. The machine check registers consist of a set of control, status, and error-reporting MSRs that are used to detect and report on hardware (machine) errors. Specifically the IA-32 processors implement a machine check ar- chitecture that provides a mechanism for detecting and reporting errors such as: system bus errors, ECC errors, parity errors, cache errors, and TLB errors.

CPU emulators have to offer an execution environment suitable for running an application or even a commodity operating system. Given the complexity of IA-32 architecture, fully featured CPU emulators for this architecture are complex pieces of software. Our claim is that this complexity is the cause of a large number of defects.

3. OVERVIEW

This section describes how CPU emulators work, formalizes our notion of faithful emulation of a physical CPU, and sketches the idea behind our testing methodology.

3.1. CPU Emulators

By CPU emulator, we mean a piece of software system that simulates the execution en- vironment offered by a physical CPU. The execution of a binary program P is emulated when each instruction of P is executed by a CPU emulator. Inside a CPU emulator instructions are typically executed using either interpretation or just-in-time transla- tion. Here, we are only interested in emulators adopting the former strategy, in such case instructions are executed by mimicking in every detail the behavior of the physical CPU, obviously operating on the resources of the emulated execution environment.

The execution environment can be properly emulated even if some internal compo- nents of the physical CPU are not considered (e.g., the instruction cache): as these components are used transparently by the physical CPU, no program can access them. Similarly, emulated execution environments can contain extra, but transparent, com- ponents not found in hardware execution environments (e.g., the cache used to store translated code).

3.2. Faithful CPU Emulation

Given a physical CPU CP , we denote with CE a software CPU emulator that emulates CP . Our ideal goal is to automatically analyze a given CE to tell whether it faithfully emulates CP . In other words, we would like to tell if CE behaves equivalently to CP , in the sense that any attempt to execute a valid (or invalid) instruction results in the same behavior in both CP and CE. In the following, we introduce some definitions which will help us to precisely define this equivalence notion.

Let N be the number of bits used by a CPU C for representing its memory addresses as well as the registers contents. A state s of C is represented by the following tuple s = ( pc, R, M, E) where — pc ∈ {0, . . . , 2N − 1} ∪ halt; —R =< r1, . . . , rk >; ri ∈ {0, . . . , 2N − 1} is the value contained in the ith CPU register; —M =< b0, . . . , b2N −1 >; bi ∈ {0, . . . , 255} is the contents of the ith memory byte; —E ∈ {⊥, illegal instruction, division by zero, general protection fault, . . .} denotes the

exception that occurred during the execution of the last instruction; the special exception state ⊥ indicates that no exception occurred.

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

A Methodology for Testing CPU Emulators 29:7

We denote by S the set of all states of a CPU. The behavior of a CPU C is modeled by a transition system (S, δC), where δC : S → S is the state-transition function which maps a CPU state s = ( pc, R, M, E) into a new state s′ = ( pc′, R′, M′, E′) by executing the instruction whose address is specified by the pc. The transition function δ is defined as follows:

δC( pc, R, M, E) def=

⎧ ⎨ ⎩

( pc, R, M, E) if pc = halt ∨ E =⊥, ( pc, R, M, E′) if an exception occurs, ( pc′, R′, M′, ⊥) otherwise.

When E′ =⊥ the contents of the registers R′, of the memory M′ and of pc′ are updated according to the semantics of the executed instruction. On the other side, if an exception occurs, then we assume for simplicity2 that δC( pc, R, M, E) = ( pc, R, M, E′). When the last instruction of a program is executed, the program counter is set to halt, and from that point on the state of the environment is not updated anymore.

We can now formally define what it means for CE to be a faithful emulator of CP . Intuitively, CE faithfully emulates CP if the state-transition function δCE that models CE is semantically equivalent to the function δCP that models CP . That is, for each possible state s ∈ S, δCP and δCE always transition into the same state. More formally, CE faithfully emulates CP iff:

∀ s ∈ S : δCP (s) = δCE (s). 3.3. Fuzzing and Differential Testing of CPU Emulators

Given a physical CPU CP and an emulator CE, proving that CE faithfully emulates CP is unfeasible as it requires the verification of a huge number of states. Thus, our aim is to find witnesses of the fact that an emulator CE does not faithfully emulate CP .

We achieve this goal by generating a number of test cases, that is, CPU states s = ( pc, R, M, E), and looking for a test case s̄ which proves that CE unfaithfully emulates CP , that is,3

s̄ ∈ S : δCP (s̄) = δCE (s̄). Our approach for finding s̄ is based on fuzzing [Miller et al. 1990] (for test case

generation) and differential testing [McKeeman 1998] (to compare δCP (s) against δCE (s)). Once a test case s has been generated we set the state of both CP and CE to s. Then, we execute the instruction pointed by pc in both CP and CE. At the end of the execution of the instruction, we compare the final state. If no difference is found, then δCP (s) = δCE (s) holds. On the other hand, a difference in the final state proves that δCP (s) = δCE (s) and therefore that CE does not faithfully emulate CP .

Figure 2 shows an example of our testing methodology.4 We run two different test cases, namely s and s. To ease the presentation, in the figure we report only the relevant state information (three registers and the contents of few memory locations) and we represent the program counter by underlining the instruction it is pointing to. Furthermore, when the states of the two environments do not differ, we graphically overlap them. The first test case s (Figure 2(a)) consists of executing the instruction mov $0x1, %eax. We set the state of CP and CE to s and we execute in both the instruction

2Exceptions actually modify CPU registers and memory. However, in our model, when an exception occurs execution is interrupted, so these modifications can be safely ignored. 3Here we assume that δ is a function (hence, deterministic) for a specific CPU model. Indeed, even if for some instructions the CPU specifications are not completely defined, it turns out that, given an initial state, the behavior of any instruction is deterministic. Obviously, CPU undefined behaviors are not documented in the released specifications, therefore emulators do not simulate them. 4This example reflects a real defect we have found in QEMU using our testing methodology.

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

29:8 L. Martignoni et al.

Fig. 2. An example of our testing methodology with two different test cases (s and s): (a) no deviation in the behavior is observed, (b) the words at the top of the stack differ (highlighted in gray).

pointed by the program counter. As there is no difference in the final states, we conclude that δCE (s) = δCP (s). The second test case s (Figure 2(b)) consists of executing the instruction push %fs, that saves the segment register fs on the stack. Although the register is 16 bits wide, the IA-32 specification dictates that, when operating in 32-bit mode, the CPU has to reserve 32 bits of the stack for the store. In the example, we observe that CP leaves the upper 16 bits of the stack untouched, while CE overwrites them with zero (the different bytes are highlighted in the figure). The two final states differ because the contents of their memory differs, consequently, δCP (s) = δCE (s). That proves that CE does not faithfully emulate CP .

4. EMUFUZZER

The development of the approach briefly described in the previous section requires overcoming two major difficulties. First, as the potential number of states in which an emulator should be tested is prohibitively large, we have to focus our efforts on selecting a small subset of states, which maximizes the completeness of the testing. Second, the detection of deviations in the behaviors of the two environments requires us to properly setup and inspect their state at the end of the execution of each test case. Thus, we need to develop a mechanism to efficiently initialize and compare the state of the two environments. In this section, we provide a detailed description of how these difficulties have been overcome.

Although the methodology we are proposing is architecture independent, our implementation, called EmuFuzzer, is currently specific for IA-32. This choice is solely motivated by our limited hardware availability. Nevertheless, minor changes to the

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

A Methodology for Testing CPU Emulators 29:9

Fig. 3. Pseudocode of the program which generates a test case.

implementation would be sufficient to port it to different architectures. To ease the development, the current version of the prototype runs entirely in user-space and thus can only verify the correctness of the emulation of unprivileged instructions and whether privileged instructions are correctly prohibited. EmuFuzzer deals with two different types of emulators: process emulators that emulate a single process at a time (e.g., Valgrind, PIN, and QEMU), and whole-system emulators that emulate an entire system (e.g., BOCHS, JPC, and QEMU5).

4.1. Test Case Generation

As just mentioned, in our testing methodology, a test case s = ( pc, R, M, ⊥) is a state of the environment under test. The memory contains the code that will be executed by the CPU, as well as the corresponding data part of which is contained in R. To generate test cases we adopt two strategies: (I) random test case generation, where both data and code are random, and (II) CPU-assisted test case generation, where data is random, and code is generated algorithmically, with the support of the physical and of the emulated CPUs. The advantage of using two different strategies is a better coverage of the test case space. Test cases are generated by an assembly program, which contains instructions for environment initialization, that is, memory and registers, and loads into the test case memory one single instruction, that is, the instruction we want to test. Figure 3 shows a C pseudocode of such a program. This program initializes the state of the environment, by loading the memory content (lines 6–10) and the data in the CPU registers (lines 12–15), and subsequently it triggers the execution of the code of the test case (line 19). The program is compiled with appropriate compiler flags to generate a tiny self-contained executable (i.e., that does not use any shared library).

There are other possible approaches to generate the code of test cases. For example, one can generate assembly instructions and then compile them with an assembler or use a disassembler to detect which sequences of bytes encode a legal instruction. However, limitations of the assembler or of the disassembler negatively impact on the completeness of the generated test cases. Besides our approach, detailed in the

5QEMU supports both whole-system and process emulation.

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

29:10 L. Martignoni et al.

following, none of the ones just mentioned can guarantee no false-negative (i.e., that a sequence of bytes encoding a valid instruction is considered invalid).

4.1.1. Random Test Case Generation. In random test case generation, both data and code of the test case are generated randomly. The memory is initialized by mapping a file filled with random data. For simplicity, the same file is mapped multiple times at consecutive addresses until the entire user-portion of the address space is allocated. To avoid a useless waste of memory, the file is lazily mapped in memory, such that physical memory pages are allocated only if they are accessed. The CPU registers are also initialized with random values. As we work in user-space, we cannot allocate the entire address space because a part of it is reserved for the kernel. Therefore, to minimize page faults when registers are used to dereference memory locations, we make sure the value of general purpose registers fall around the middle of the allocated user address space. The rationale is to maximize the probability that, for any instruction, memory operands refer to valid locations. Obviously, code generated with this random approach might contain more than one instruction.

4.1.2. CPU-Assisted Test Case Generation. A thorough testing of an emulator requires us to verify that each possible instruction is emulated faithfully. Unfortunately, the pure random test case generation approach presented earlier is very unlikely to cover the entire instruction set of the architecture (the majority of CPU instructions require operands encoded using specific encoding and others have opcodes of multiple bytes). Ideally, we would have to enumerate and test all possible instances of instructions (i.e., combinations of opcodes and operands). Clearly this is not feasible. To narrow the problem space, we identify all supported instructions and then we test the emulator using only few peculiar instances of each instruction. That is, for each opcode, we generate test cases by combining the opcodes with some predefined operand values. As in random-test case generation, the data of the test case are random.

Naı̈ve Exploration of the Instruction Set. Our algorithm for generating the code of a test case leverages both the physical and the emulated CPUs, in order to identify byte sequences representing valid instructions. We call our algorithm CPU-assisted test case generation. The algorithm enumerates the sequences of bytes and discards all the sequences that do not represent valid code. The CPU is the oracle that tells us if a sequence of bytes encodes a valid instruction or not: sequences that raise illegal instruc- tion exceptions do not represent valid code. We run our algorithm on the physical and on the emulated CPUs and then we take the union of the two sets of valid instructions found. The sequences of bytes that cannot be executed on both CPUs are discarded be- cause they do not represent interesting test cases: we know in advance that the CPUs will behave equivalently (i.e., E′ = illegal instruction). On the other hand, a sequence of bytes that can be executed on at least one of the two CPUs is considered interesting because it can lead to one of the following situations: (I) it represents a valid instruction for one CPU and an invalid instruction for the other; (II) it encodes a valid instruction for both CPUs but, once executed, causes the CPUs to transition to two different states.

Optimized Exploration of the Instruction Set. We can imagine representing all valid CPU instructions as a tree, where the root is the empty sequence of bytes and the nodes on the path from the root to the leaves represent the various bytes that compose the instruction. Figure 4(a) shows an example of such a tree. Our algorithm exploits a particular property of this tree in order to optimize the traversal and to avoid the gen- eration of redundant test cases: the majority of instructions have one or more operands and thus multiple sequences of bytes, sharing the same prefix, encode the same instruction, but with different operands. In the following, we describe an example of the optimized instruction set exploration; further details are then given in Section 4.2.

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

A Methodology for Testing CPU Emulators 29:11

Fig. 4. Example of CPU-assisted test case generation for the opcode 66 05 (mov imm16,%ax): (a) naı̈ve and (b) optimized generation (paths in gray are not explored).

As an example, let us consider the 216 sequences of bytes from 66 05 00 00 to 66 05 FF FF that represent the same instruction, add imm16,%ax, with just different values of the 16-bit immediate operand. Figure 4(a) shows the tree representation of the bytes that encode this instruction. The subtree rooted at node 05 encodes all the valid operands of the instruction. Without any insight on the format of the instruction, one has to traverse in depth-first ordering the entire subtree and to assume that each path represents a different instruction. Then, for each traversed path, a test case must be generated. Our algorithm, by traversing only few paths of the subtree rooted at node 05, is able to infer the format of the instruction: (I) the existence of the operand, (II) which bytes of the instruction encode the opcode and which ones encode the operand, and (III) the type of the operand. Once the instruction has been decoded (in the case of the example the opcode is 66 05 and it is followed by a 16-bit immediate), without having to traverse the remaining paths, our algorithm generates a minimal set of test cases with a very high coverage of all the possible behaviors of the instruction. These test cases are generated by fixing the bytes of the opcode and varying the bytes of the operand. The intent is to select operand values that more likely generate the larger class of behaviors (e.g., to cause an overflow or to cause an operation with carry). For example, for the opcode 66 05, our algorithm decodes the instruction by exploring

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

29:12 L. Martignoni et al.

only 0.5% of the total number of paths and generates only 56 test cases. The optimized tree traversal is shown in Figure 4(b), where paths in gray are those that do not need to be explored. The heuristics on which our rudimentary, but faithful, instructions decoder is built on is described in Section 4.2. It is worth noting that, unlike traditional disassemblers, we decode instructions without any prior knowledge of their format. Thus, we can infer which bytes of an instruction represent the opcode, but we do not know which high-level instruction (e.g., add) is associated with the opcode.

4.2. The Decoder

The optimized traversal algorithm, just described in Section 4.1.2, requires the ability to decode an instruction, and to identify its opcode and operands. Such a task is undertaken by a specific module (less than 500 lines of code) which we named the decoder. The decoder uses the CPU as an oracle: given a sequence of bytes, the CPU tells us if that sequence encodes a valid instruction or not [Paleari et al. 2010]. The decoding is trial-based: we mutate an executable sequence of bytes, we query the oracle to see which mutations are valid and which are not, and from the result of the queries we infer the format of the instruction. Mutations are generated following specific schemes that reflect the ones used by the CPU to encode operands [Intel 2008].

In the following, we briefly describe how the decoder infers the length of an in- struction and the format of nonimplicit operands, assuming to know only the encoding schemes used to encode operands.

4.2.1. Determining Instruction Length. For determining the length of a given instruction, the decoder exploits the fact that the CPU fetches, and decodes, the bytes of the in- struction incrementally. Given an arbitrary sequence of bytes B = b1 . . . bn, the first goal is to detect if the bytes represent a valid instruction. The decoder executes the input string B in a specially crafted execution environment, such that every fetch of the bytes composing the instruction can be observed.

The decoder partitions B into subsequences of incremental length (B1 = b1, B2 = b1b2, . . . , Bn = b1 · · · bn) and then executes one subsequence after another, using single- stepping. The goal is to intercept the fetch of the various bytes of the instruction, which is achieved by placing the ith subsequence Bi (with i = 1 . . . n) in memory such that it overlaps two adjacent memory pages, m and m′. The first i bytes are located at the end of m, and the remaining (n− i) bytes at the beginning of m′. The two pages have special permissions: m allows read and execute accesses, while m′ prohibits any access. When the instruction is executed, the i bytes in the first page are fetched incrementally by the CPU. If the instruction is longer than i bytes, the CPU will try to fetch the next byte, (i + 1)th , and will raise a page fault exception (where the faulty address corresponds to the base address of m′) because the page containing the byte being read, m′, is not accessible. In this case, the decoder repeats the process with the string Bi+1, that is placing the i + 1th bytes at the end of m and the remaining at m′. On the other hand, if the instruction contained in the page m has the correct length, it will be executed by the CPU without accessing the bytes in m′. In such a situation the instruction can be both valid and invalid. The instruction is valid if it is executed without causing any exception; it is also valid if the CPU raises a page fault (in this case, the faulty address does not correspond to the base address of m′) or a general protection fault exception. A page fault exception occurs if the instruction tries to read or write data from the memory; a general protection fault exception is raised if the instruction has improper operands. The instruction is invalid instead, if the CPU raises an illegal instruction exception. In both cases, the decoder returns.

Figure 5 shows our CPU-assisted decoder in action on two different sequences of bytes, one valid and one invalid. The first sequence is B = 88 b7 53 10 fa ca...,

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

A Methodology for Testing CPU Emulators 29:13

Fig. 5. Computation of the length of instructions using our CPU-assisted instruction decoder: (a) valid and (b) invalid instructions.

corresponding to the instruction mov %dh, $0xcafa1053(%edi). The decoder allocates two adjacent memory pages and removes any permission from the second one. Then, it starts with the first subsequence B1 = 88. The byte is positioned at the end of the page and then executed through single stepping. The CPU fetches and tries to decode the instruction but, since the instruction is longer than one byte, it tries to fetch the next bytes from the protected page, raising a page fault. The decoder detects the fault and concludes that the instruction is longer than one byte (in our example, the faulty address is 0x20000, the base address of the second page). It repeats the procedure with B2 = 88 b7 and gets the same result. It tries again with B3, B4, B5, and finally tries with six bytes. Since the instruction is six bytes long, the CPU executes the

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

29:14 L. Martignoni et al.

instruction without accessing the protected memory page. However, the instruction writes into the memory and thus causes a page fault. As in this case, the faulty address (0x78378943) differs from the address of the protected page, our decoder can decide that the instruction is valid and that it is six bytes long. It is worth noting that a sequence of bytes cannot encode, at the same time, a valid instruction and a prefix of a longer instruction. Indeed, such a situation would be ambiguous for the CPU. The third byte sequence in the example of Figure 5(b) is B = f0 00 c0 · · · and represents an invalid instruction. Exactly as before, our decoder executes the first two subsequences B1 and B2 and detects that the instruction is potentially longer because the CPU fetches a third byte from the protected page. When B3 is executed, the CPU does not fetch more bytes but instead raises an illegal instruction exception, testifying that B3 is neither a valid instruction, nor a valid prefix for longer instructions.

4.2.2. Decoding Nonimplicit Operands. Once the decoder finds the length of an instruction the decoder tries to infer the type and the value of the non-implicit operands of the instruction (i.e., the operands that are not implicitly encoded in the opcode of the in- struction). The technique used by our decoder to achieve this goal is an extension of the technique described in the previous paragraphs. Currently, our CPU-assisted decoder is capable of decoding addressing-form specifier operands and immediate operands.

Any Intel ×86 instruction (Figure 1) is composed of an optional prefix, an opcode, and optional operands. To ease the presentation we assume that the instructions have no prefix; in practice, prefixes are detected using a white-list and considered part of the opcode. Given an instruction, encoded by the sequence of bytes B = b1 · · · bn, the format of the operands is detected by performing a series of tests on some instructions derived by changing the bytes of B that follow the opcode and represent the operands of the in- struction. If the opcode is j bytes long, the remaining n− j bytes represent the operands. Each type of operand is encoded using a different encoding: immediate operands (Imm) are encoded as is, addressing-form specifier operands (Addr) are encoded using ModR/M and SIB encoding, and Imm ∪ Addr = Imm ∩ Addr (i.e., an immediate operand does not necessarily represent a valid addressing-form specifier operand, and vice-versa). Therefore, given an instruction encoded by the sequence of bytes B = b1 · · · bn, we ex- pect a new sequence B′ = b1 · · · bj b′j+1 · · · b′m, where b′j+1 · · · b′m represents a new operand of the same type of bj+1 · · · bm, to be valid. Contrarily, we expect another sequence of bytes B = b1 · · · bj bj+1 · · · bm, where bj+1 · · · bm represent an operand of a different type, to be invalid. Therefore, if an instruction with a j bytes long opcode has an immediate operand, then the following holds:

∀b′j+1 . . . b′m ∈ Imm, B′ = b1 . . . bj b′j+1 . . . b′m is valid. In other words, the bytes following the opcode encode an immediate operand if the combination of the opcode with all the possible immediate operands always gives valid instructions. Fortunately, with few tests it is possible to estimate if the previ- ous equation holds. In fact, it is sufficient to verify if it holds for a small number of operands in Imm \ Addr. The same applies for an instruction with an addressing-form specifier operand. Our current prototype of the decoder uses only five tests to de- code addressing-form specifier operands and four to detect 32-bit immediate operands. Basically, in order to infer if an instruction refers to an operand in memory, we use spe- cific configurations of the ModR/M and SIB fields (e.g., [EAX], [EAX]+disp, [EBP]+disp, etc.). Since the opcode can have a variable length (from one to three bytes), our CPU- assisted decoder performs the aforementioned tests with opcodes of incremental length (i.e., j = 1, 2, 3).

Figure 6 shows some of the tests performed by our CPU-assisted instruction decoder to infer the format of the operands of two instructions: the first instruction has an

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

A Methodology for Testing CPU Emulators 29:15

addressing-form specifier operand and the second one a 32-bit immediate operand. For the first instruction, the decoder initially assumes that the opcode is one byte long, and performs the analysis of the remaining bytes to detect if they encode an addressing-form specifier operand. To do that it combines the opcode 88 with other valid addressing-form specifier operands of variable length, some of which cannot be interpreted as immediate operands. The first test consists of replacing the alleged operand with a single byte operand and in executing the resulting string. The CPU successfully executes the instruction. The same procedure is repeated with operands of different length (two, three, and seven bytes). All the sequences of bytes are found to encode valid instructions; every execution of the tested instructions raise a page fault exception where the faulty address does not correspond to the base address of the protected page. Therefore, the input instruction is composed of a single byte opcode followed by an addressing-form specifier operand (b7 53 10 fa ca, in 6). The same pro- cedure is applied also to the second instruction. The addressing-form specifier operand decoding fails, so the decoder attempts to verify whether the last four bytes of the instruction encode a 32-bit immediate. All tests performed are passed.

4.3. Test Case Execution

Given a test case, we have to execute it both on the physical and emulated CPUs and then compare their state at the end of the execution. In order to perform such a task we have developed two different applications, the first one denoted by E runs on the emulator and the second one, denoted by P will run on the physical CPU as a user space application. Initially, we start the execution of the test case on the emulator. As soon as the initialization of the state of the emulator is completed, it is replicated to the physical CPU. As registers and memory are initialized with random values, replication is required to guarantee that test cases are executed on the physical and emulated environments starting from the same initial state. Then, the code of the test case is executed in the two environments and, at the end of the execution, we compare the final state. In the remainder of this section, we describe the main steps performed for the execution of a test case and we will also provide details on the strategy we adopted for instrumenting the emulator and the physical environment in order to execute respectively the programs E and P. For simplicity, the details that follow are specific for the testing of process emulators. Nonetheless, the implementation for testing whole- system emulators only requires the addition of introspection capabilities to isolate the execution of the test case program [Garfinkel and Rosenblum 2003].

4.3.1. Executing a Test Case. The execution flow of a test case is summarized in Figure 7 and described in detail in the following paragraphs, where the following notation will be adopted. The state of the emulator CE prior and after the execution of a test case respectively sE = ( pcE, RE, ME, EE) and s′E = ( pc′E, R′E, M′E, E′E). Similarly, for CP , we use respectively sP = ( pcP , RP , MP , EP ) and s′P = ( pc′P , R′P , M′P , E′P ).

Setup of the Emulated Execution Environment. The CPU emulator is started and it begins to execute the program E generating and executing the test case (LE1) until the state of the environment is completely initialized (LE2). In other words, E is executed without interference until the execution reaches pcE, that is, the address of the code of the test case (see line 19, Figure 3). E initializes the emulator memory by mapping a file filled with random data. For simplicity, the same file is mapped multiple times at consecutive addresses until the entire user-portion of the address space is allocated. To avoid a useless waste of memory, the file is lazily mapped in memory, such that physical memory pages are allocated only if they are accessed. As we discussed in Section 4.1.1, CPU registers are also initialized with random values.

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

29:16 L. Martignoni et al.

Fig. 6. Decoding of nonimplicit operands using our CPU-assisted instruction decoder: instructions with (a) addressing-form specifier operand and (b) immediate operand.

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

A Methodology for Testing CPU Emulators 29:17

Fig. 7. Logic of the execution of a test case (t.c., for short). denotes the execution of the test case and denotes the execution of the code of the logic.

Setup of the Physical Execution Environment. When the state of the emulated envi- ronment has been set up (i.e., when the execution has reached pcE), the initial state, sE = ( pcE, RE, ME, EE), can be replicated into the physical environment. The emulator notifies and transfers the state of the CPU registers to P (LE3). Initially, the exception state EE is always assumed to be ⊥. Note that the memory state of the physical CPU MP is not synchronized with the emulated CPU. At the beginning, only the memory page containing the code of the test case is copied into the physical environment (LP1 and LE4). The remaining memory pages are instead synchronized on-demand the first time they are accessed, as it will be explained in detail in the next paragraph. At this point we have that RE = RP , EE = EP = ⊥, but ME = MP (the only page that is synchronized is the one with the code).

Test Case Execution on the Physical CPU. The execution of the code of the test case on the physical CPU starts, beginning from program address pcP = pcE (LP3). P besides an initialization routine, to set up the execution environment, also contains a finalization routine, to save the content of the registers; moreover, test cases instructions are patched to avoid unwanted control transfers. For further details, see Section 4.3.3. During the execution of the code, the following situations may occur.

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

29:18 L. Martignoni et al.

(I) execution of the code of the test case terminates; (II) a page-fault exception caused by an access to a missing page occurs;

(III) a page-fault exception caused by a write access to a non-writable page occurs; (IV) any other exception occurs.

Situation (I) indicates that the entire code of the test case is executed successfully. That means that the instruction in the test case was valid and did not generate any fatal CPU exception. The first type of page-fault exceptions (II) allows us to synchronize lazily the memory containing the data of the test case at the first access. During the initialization phase (LP2) all the memory pages of the physical environment, but that containing the code (and few others containing the code to run the logic), are protected to prevent any access. Consequently, if an instruction of the test case tries to access the memory, we intercept the access through the page fault exception and we retrieve the entire memory page from the emulated environment (LP4 and LE5). All data pages retrieved are initially marked as read-only to catch future write accesses. After that, the execution of the code of the test case on the physical CPU is resumed (LP5). The second type of page-fault exceptions (III) allows us to intercept write accesses to the memory. Written pages are the only pages that can differ from one environment to the other. Therefore, after a faulty write operation we flag the memory page as written. Then, the page is marked as writable and the execution is resumed (LP6 and LP7). Obviously, depending on the code of the test case, situations (II) and (III) may occur repeatedly or may not occur at all during the analysis. Finally, the occurrence of any other exception (IV) indicates that the execution of the code of the test case cannot be completed because the CPU is unable to execute an instruction. When the execution of the code of the test case on the physical CPU terminates, because of (I) or (IV), P regains the control of the execution, immediately saves the state of the environment for future comparisons (LP8), and restores the state of the CPU prior to the execution of the test case.

Test Case Execution on the Emulated CPU. The execution of the code of the test case in the emulated environment, previously stopped at pcE (LE2), can now be safely resumed. The execution of the code in the emulated environment must follow the exe- cution in the physical environment and cannot be concurrent with it. This is because in the physical environment the state of the memory is synchronized on-demand and thus the initial state of the memory ME must remain untouched until the physical CPU com- pletes the execution of the test case. When this happens the execution is resumed and it terminates when all the code of the test case is executed or an exception occurs (LE6).

Comparison of the Final State. When the emulator and the physical environments have completed the execution of the test case, we can compare their state (s′E = ( pc′E, R′E, M

′ E, E

′ E) and s

′ P = ( pc′P , R′P , M′P , E′P )). The comparison is performed by P. The

emulator notifies P and then transfers the program counter pc′E, the current state of the CPU registers R′E, and the exception state E

′ P (LE7). To compare s

′ E and s

′ P , it is not

necessary to compare the entire address space: P fetches only the contents of the pages that have been marked as written (LP10 and LE8). At this point s′E is compared with s

′ P

(LP11). If s′E differs from s ′ P , we record the test case and the difference(s) produced.

4.3.2. Embedding the Logic in the CPU Emulator. Program E is run directly in the emulator under analysis. The emulator is extended to include the code of E. We embed the code leveraging the instrumentation API provided by the majority of the emulators. The main functionalities of the embedded code are the following. First, it allows to intercept the beginning and the end of the execution of each instruction (or basic block, depending on the emulator) of the emulated program. If the code of the test case contains multiple instructions, all basic blocks (or instructions) are intercepted and contribute to the testing. We assume the code used to initialize the environment

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

A Methodology for Testing CPU Emulators 29:19

is always correctly emulated and thus we do not test it nor we intercept its execution. Second, the embedded code allows to intercept the exceptions that may occur during the execution of the test case. Third, it provides an interface to access the values of the registers of the CPU and the contents of the memory of the emulator.

4.3.3. Running the Logic on the Physical CPU. On the physical CPU, the test case is run through a user-space program that implements the various steps described in 4.3.1. An initialization routine (LP2 in Figure 7), is used to set up the registers of the CPU, to register signal handlers to catch page faults and the other run-time exceptions that can arise during the execution of the test case, and to transfer the control to the code of the test case. The code of the test case is executed as a shellcode [Koziol et al. 2004] and consequently we must be sure it does not contain any dangerous control transfer instruction that would prevent us from regain the control of the execution (e.g., jumps, function calls, system calls). Given the approaches we use to generate the code of the test cases, we cannot prevent the generation of such dangerous test cases. Therefore, we rely on a traditional disassembler to analyze the code of the test case, identify dangerous control transfer instructions, and patch the code to regain the control of the execution (e.g., by modifying the target address of direct jump instructions).6 To prevent endless loops caused by failures of this analysis, we put a limit on the maximum CPU time available for the execution of a test case and we interrupt the execution if the limit is exceeded. In the current implementation, this limit is set to 5s, and has been determined experimentally to guarantee detection of endless loops. At the end of the code of the test case, we append a finalization routine (LP8 in Figure 7), that is used to save the contents of the registers for future comparison, to restore their original contents, and to resume the normal execution of the remaining steps of the logic. Exceptions other than page-faults interrupt the execution of the test case. The handlers of these exceptions record the exception occurred and overwrite the faulty instruction and the following ones with nops, to allow the execution to reach the finalization routine to save the final state of the environment.

In the approach just described, the program P and the test case share the same address space. Therefore, the state of the memory in the physical environment differs slightly from the state of the memory in the emulated environment: some memory pages are used to store the code and the data of the user-space program, through which we run the test case. If the code of the test case accesses any of these pages, we would notice a spurious difference in the state of the two environments. Considering that the occurrence of such event is highly improbable, we decided to neglect this problem, to avoid complicating the implementation.

5. EVALUATION

This section presents the results we obtained by testing five IA-32 emulators with EmuFuzzer: three process emulators (QEMU, Valgrind, and Pin) and two system emulator (BOCHS and JPC). Specifically, we chose QEMU and BOCHS because they are the most widely used IA-32 emulators, Valgrind and Pin because, despite them being dynamic instrumentation tools, their internal architecture resembles the architecture of a traditional emulator, and finally JPC because it is an IA-32 emulator fully developed in Java language and therefore portable on several platforms and devices (e.g., mobile devices).

We generated a large number of test cases, evaluated their quality, and fed them to the five emulators. None of the emulators tested turned out to be faithful. In each

6If the disassembler failed to detect dangerous control transfer instructions, we could not be able to regain the control of the execution properly.

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

29:20 L. Martignoni et al.

of them, we found different classes of defects: small deviations in the contents of the status register after arithmetical and logical operations, improper exception raising, incorrect decoding of instructions, and even crash of the emulator. Our experimental results lead to the following conclusions: (I) developing a CPU emulator is actually very challenging, (II) developers of these software would highly benefit from specialized testing methodology, and (III) EmuFuzzer proved to be a very effective tool for testing CPU emulators.

5.1. A Glimpse at the Implementation

The current EmuFuzzer implementation consists of three interconnected components: a coordinator and two drivers (one for the physical CPU, and one for the emulator under analysis). The coordinator supervises the execution of a test case. The driver that controls the execution on the physical CPU is independent from any specific emulator. On the contrary, the driver for CPU emulator augments a specific emulator with the features needed to intercept the execution of a single instruction and to inspect the execution state; this driver is obviously emulator-specific, and we implemented a different emulator driver for each CPU emulator we considered in our experiments.

For a given test case, s, the coordinator first leverages the emulator driver to set up the emulated execution environment: CPU registers and memory locations are initialized as specified by s. Subsequently, as just mentioned in Section 4.3, the coordinator starts executing the test case on the physical processor. Such an execution may require the setting of some CPU registers or memory locations, that are thus fetched from the emulated environment and replicated into the physical one. Once the execution of the test case completes, the processor final state δCP (s) is dumped to a file. At this point the coordinator starts the execution of s on the emulator. Also in this case the final state of the computation, δCE (s) will be dumped to a file. Then, the coordinator compares δCP (s) and δCE (s).

The coordinator is written in Python (∼750 noncomment lines of code), while the CPU driver consists of roughly 1200 noncomment lines of C++ code. Finally, emulator drivers are written in the same language of the target emulator, typically C. On average, an emulator driver requires about 450 noncomment lines of code (of these, 350 lines are emulator-independent and are shared between all emulator drivers). Communication between the coordinator and the drivers relies on the XML-RPC protocol.

5.2. Experimental Setup

We performed the evaluation of our testing methodology and tool using an Intel Pentium 4 (3.0 GHz), running Debian GNU/Linux with kernel 2.6.26, as baseline physical CPU. The physical CPU supported the following features: MMX, SSE, SSE2, and SSE3. We tested the following release of each emulator, namely: QEMU 0.9.1, Valgrind 3.3.1, Pin 2.5-23100, JPC 2.4, and BOCHS 2.3.7. The features of the physical machine were compatible with the features of the tested emulators with few exceptions, which we identified at the end of the testing, using a traditional disassembler, and ignored (e.g., BOCHS also supports SSE4).

5.3. Evaluation of Test Case Generation

We generated about 3 million test cases, 70% of which using our CPU-assisted algo- rithm and the remaining 30% randomly. We empirically estimated the completeness of the set of instructions covered by the generated test cases by disassembling the code of the test cases, by counting the number of different instructions found (operands were ignored), and by comparing this number with the total number of mnemonic instructions recognized by the disassembler. The randomly generated test cases cov- ered about 75% of the total number of instructions, while the test cases generated using

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

A Methodology for Testing CPU Emulators 29:21

Table II. Results of the Evaluation: Number of Distinct Mnemonic Opcodes (OP) and Number of Test Cases (TC) that Triggered Deviations in the Behavior between the Tested Emulators and the Baseline Physical CPU

Deviation type QEMU Valgrind Pin BOCHS JPC

OP TC OP TC OP TC OP TC OP TC

R CPU flags 39 1362 13 684 22 2180 2 2686 33 4088 CPU general 3 142 8 141 3 18 8 8 27 657 FPU 179 41738 157 39473 0 0 71 1631 185 43024

M memory state 34 1586 10 420 0 0 1 2 46 2122

E not supported 2 1120 334 11513 2 12 0 0 8 1998 over supported 97 1859 10 716 0 0 5 8 124 1930 other 126 6069 41 6184 20 34 45 113 132 5935 Total 405 53926 529 59135 43 2245 130 4469 482 59354

our CPU-assisted algorithm covered about 62%. Overall, about 81% of the instructions supported by the disassembler were included in the test cases used for the evaluation. It is worth noting that in several cases our test cases contained valid instructions not recognized by the disassembler.

The implementation of our CPU-assisted algorithm is not complete and lacks support for all instructions with prefixes. For example, currently our algorithm does not gener- ate test cases involving instructions operating on 16-bits operands. We have empirically estimated that instructions with prefixes represent more than 25% of the instructions space. Therefore, a complete implementation of the algorithm would allow to achieve a nearly total coverage. We speculate that the high coverage of randomly generated test cases is due to the fact that the IA-32 instruction set is very dense and consequently a random bytes stream can be interpreted as a series of valid instructions with high probability. Nevertheless, during our empirical evaluation we reached a local optimum from which it was impossible to move away, even after having generated hundreds of thousands of new test cases. The CPU-assisted algorithm instead does not suffer this kind of problem: a complete implementation would allow to generate a finite number of test cases exercising all instructions in multiple corner cases.

5.4. Testing of IA-32 Emulators

The five CPU emulators were tested using a small subset (∼10%) of the generated test cases, selected randomly. The whole testing took about a day for all the emulators but JPC, at the speed of around 15 test cases per second (JPC alone took several days to run all the test cases). Table II reports the results of our experiments. Behavioral differences found are grouped into three categories: CPU registers state ( R), memory state (M), and exception state (E). Differences in the state of the registers are further separated according to the type of the registers: status (CPU flags), general purpose and segment (CPU general), and floating-point (FPU). Differences in the exception state are separated in: legal instructions not supported by the emulator (not supported), illegal instructions valid for the emulator (over supported), and other deviations in the exception state (other). As an example, the last class includes instructions that expect aligned operands but execute without any exception even if the constraint is not satisfied. For each emulator and type of deviation, the table reports the number of distinct mnemonic opcodes leading to the identification of that particular type of deviation (opcodes) and the number of test cases proving the deviation (test cases). It is worth pointing out that different combinations of prefixes and opcodes are considered as different mnemonic opcodes. For each distinct opcode that produced a particular type of deviation, we verified and confirmed manually the correctness of at least one of the results found.

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

29:22 L. Martignoni et al.

The results demonstrate the effectiveness of the proposed testing methodology. For each emulator we found several mnemonic opcodes not faithfully emulated: 405 in QEMU, 529 in Valgrind, 43 in Pin, 130 in BOCHS and 482 in JPC. It is worth noting that some of the deviations found might be caused by too lax specifications of the physical CPU. For example, the manufacturer documentation of the add instruction precisely states the effect of the instruction on the status register, while the documentation of the and instruction only states the effect on some bits of the status register, while leaving undefined the value the remaining bits [Intel 2008]. Our reference of the specification is the CPU itself and consequently, with respect to our definition of faithful emulation, any deviation has to be considered a tangible defect. Indeed, for each deviation discovered by EmuFuzzer it is possible to write a program that executes correctly in the physical CPU, but crashes in the emulated CPU (or vice-versa). We manually transformed some of the problematic test cases into this kind of programs and verified the correctness of our claim. The remarkable number of defects found also witnesses the difficulty of developing a fully featured and specification-compliant CPU emulator and motivates our conviction about the need of a proper testing methodology.

The following paragraphs summarize the defects we found in each emulator. The description is very brief because the intent is not criticize the implementation of the tested emulators, but just to show the strength of EmuFuzzer at detecting various classes of defects.

In Paleari et al. [2011], we release all the improper behaviors we detected in the emulators supported by EmuFuzzer. Developers were informed about the defects found in their emulators, providing them with the corresponding test cases.

5.4.1. QEMU. A number of arithmetical and logical instructions are not properly exe- cuted by the emulator because of an error in the routine responsible for decoding certain encoding of memory operands (e.g., or %edi, 0x67(%ebx) encoded as 08 7c e3 67); the instructions reference the wrong memory locations and thus compute the wrong re- sults. The emulator accepts several illegal combinations of prefixes and opcodes and executes the instruction ignoring the prefixes (e.g., lock fcos). Floating-point instruc- tions that require properly aligned memory operands are executed without raising any exception even when the operands are not aligned, because the decoding routine does not perform alignment checking (e.g., fxsave 0x00012345). Segments registers, which are 16 bits wide, are emulated as 32-bit registers (the unused bits are set to zero), thus producing deviations when they are stored in other 32-bits registers and in memory (e.g., push %fs). Some arithmetic and logical instructions do not faithfully update the status register. Finally, we found sequences of bytes that freeze and others that crash the emulator (e.g., xgetbv).

5.4.2. Valgrind. Some instructions have multiple equivalent encodings (i.e., two dif- ferent opcodes encode the same instruction) but the emulator does not recognize all the encodings and thus the instructions are considered illegal (e.g., addb $0x47, %ah with opcode 82). Several legal privileged instructions, when invoked with insufficient privileges, do not raise the appropriate exceptions (e.g., mov (%ecx), %cr3 raises an illegal operation exception instead of a general protection fault). On the physical CPU, each instruction is executed atomically and, consequently, when an exception occurs the state of the memory and of the registers correspond to the state preceding the execution of the instruction. On Valgrind instead, instructions are not executed atomi- cally because they are translated into several intermediate instructions. Consequently, if an exception occurs in the middle of the execution of an instruction, the state of the memory and of the registers might differ from the state prior to the execution of the instruction (e.g., idiv (%ecx) when the divisor is zero). As in QEMU, some logical instructions do not faithfully update the status register.

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

A Methodology for Testing CPU Emulators 29:23

5.4.3. Pin. Not all exceptions are properly handled (i.e., trap and illegal instruction exceptions); Pin does not notify the emulated program about these exceptions. Sev- eral legal instructions that raise a general protection fault on the physical CPU are executed without generating any exception on Pin (e.g., add %ah, %fs:(%ebx)). When segment registers are stored (and removed) in the stack, the stack pointer is not up- dated properly: a double-word should be reserved on the stack for these registers, but Pin reserves a single word (e.g., push %fs). The FPU appears to be virtualized (i.e., the floating-point code is executed directly on the physical FPU) and, as expected, no deviation is detected in the execution of FPU instructions. As in Valgrind and QEMU, some logical instructions do not faithfully update the status register.

5.4.4. BOCHS. Certain floating-point instructions alter the state of some registers of the FPU and other instructions compute results that differ from those computed by the FPU of the physical CPU (e.g., fadd %st0, %st7). If an exception occurs in the middle of the execution of an instruction manipulating the stack, the initial contents of the stack pointer corresponds to that we would have if the instruction were successfully executed (e.g., pop 0xffffffff). Some instructions do not raise the proper exception (e.g., int1 raises a general protection fault instead of a trap exception). As in Valgrind, QEMU, and Pin, some logical instruction do not faithfully update the status register, although the number of such instruction is smaller than the number of instructions affected by this problem in the other emulators.

5.4.5. JPC. Conversely to the other emulators, which are written in C and C++, JPC is fully developed in Java. It turned out that one of the main problems we had to deal with for testing this emulator was its poor performances executing test cases: JPC is approximately 75% slower than any other emulator. A description of the main deviations found follows. Segment registers, which are 16-bits wide, are emulated as 32-bit registers. This implies deviations when segment registers are stored in other 32-bits registers and in memory (e.g., push %gs). Several legal privileged instructions, when invoked with insufficient privileges, do not raise the appropriate exceptions (e.g., mov (%ecx), %cr0 and mov %cr3, %eax are correctly executed when the CPU is in user mode without raising a general protection fault exception). As in QEMU, the emulator accepts illegal combination of prefixes and executes the instruction ignoring them (e.g., lock fcos). Moreover, not all exceptions are properly handled (i.e., illegal instruction exceptions) and some instructions do not raise the appropriate exception. To conclude, we found several sequences of bytes that crash the emulator (e.g., bound %eax, (%ebx) and int1).

6. DISCUSSION

EmuFuzzer currently works in user-space and thus it can only verify whether un- privileged code is not emulated faithfully, with few exceptions. For example, some unprivileged instructions that access segment registers might not be tested because it is not possible to manipulate properly the value of these registers from user-space. Fortunately, in many cases the values of the segment registers in the emulated and in the physical environments do not need to be manipulated as they already match. Another limitation is that, from user-space, we cannot manipulate control registers and thus we cannot enable supplementary CPU-enforced alignment checking and the other enforcements it offers, which are disabled by default. In the future we plan to port the component running in the physical environment of EmuFuzzer in kernel-space, to be able to perform a more thorough testing. Furthermore, we plan to test new CPU emulators and also to use EmuFuzzer to test the emulation routines adopted by virtual machines to emulate non-virtualizable instructions (i.e., privileged instructions that do

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

29:24 L. Martignoni et al.

not cause a trap if executed with insufficient privileges [Robin and Irvine 2000]). As an example, VirtualBox [Sun Microsystem 2011] leverages QEMU code for that purpose.

7. RELATED WORK

7.1. Software Testing

Fuzz-testing has been introduced by Miller et al. [1990], and it is still widely used for testing different types of applications. Originally, fuzz-testing consisted of feeding applications purely random input data and detecting which inputs were able to crash an application, or to cause unexpected behaviors. Today, this testing methodology is used to test many different types of applications; for example, GUI applications, web applications, scripts, and kernel drivers [DeMott 2006].

As certain applications require inputs with particular format (e.g., a XML document or a well-formed Java program), pure randomly generated inputs cannot guarantee a reasonable coverage of the code of the application under analysis. Recently devel- oped testing techniques typically leverage domain specific knowledge and use this knowledge, optionally in tandem with a random component, to drive inputs genera- tion [Daniel et al. 2007; Kaksonen 2001; Sutton et al. 2007]. An alternative approach to improve the completeness of the testing consists of building constraints that describe what properties are required for the input to trigger the execution of particular pro- gram paths, and in using a constraint solver to find inputs with these properties [Cadar et al. 2006; Godefroid et al. 2008; Sen et al. 2005; Majumdar and Sen 2007; Xu et al. 2010; Santelices and Harrold 2010]. This article presents a fuzz-testing methodology specific for CPU emulators that leverages both pure random inputs generation and domain knowledge to improve the completeness of the analysis.

In our previous works, we explored the idea of using mechanically generated tests and to compare the behavior of two components to detect deviations imputable to bugs [Paleari et al. 2010; Martignoni et al. 2009, 2010]. This approach is known in literature as differential testing [McKeeman 1998; Lahiri et al. 2010; Person et al. 2008; Taneja et al. 2011]. EmuFuzzer adopts differential testing to detect if the tested CPU emulator behaves unfaithfully with respect to the physical CPU emulated.

7.2. Computer Security

CPU emulators are widely used in computer security for various purposes. One of the most common applications is malware analysis [Bayer et al. 2006; Martignoni et al. 2008]. Emulators allow fine-grained monitoring of the execution of a suspicious programs and to infer high-level behaviors. Furthermore, they allow to isolate the execution and to easily checkpoint and restore the state of the environment. Malware authors, aware of the techniques used to analyze malware, aim at defeating those techniques such that their software can survive longer. To defeat dynamic behavioral analysis based on emulators, they typically introduce in malware routines able to detect if a program is run in an emulated or in a physical environment. As the average user targeted by the malware does not use emulators, the presence of an emulated environment likely indicates that the program is being analyzed. Thus, if the malicious program detects the presence of an emulator, it starts to behave innocuously such that the analysis does not detect any malicious behavior. Several researchers have analyzed state-of-the-art emulators to find unfaithful behaviors that could be used to write specific detection routines [Ormandy 2007; Raffetseder et al. 2007; Rutkowska 2004; Oberheide and Miller 2012]. Unfortunately for them, their results were obtained through a manual scrutiny of the source code or rudimentary fuzzers, and thus the results are largely incomplete. The testing technique presented in this paper can be used to find automatically a large class of the unfaithful behaviors that a miscreant

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

A Methodology for Testing CPU Emulators 29:25

could use to detect the presence of an emulated CPU. This information could then be used to harden an emulator, to the point that it satisfies the requirements for undetectability identified by Dinaburg et al. [2008].

8. CONCLUSIONS

CPU emulators are complex pieces of software. In this article, we presented a testing methodology for CPU emulators, based on fuzzing. Emulators are tested by generating test case programs and by executing them on the emulated and on the physical CPUs. As the physical CPU is assumed to follow perfectly the specification, defects in the emulators can be detected by comparing the state of the emulator with that of the physical CPU, after the execution of the test case program. The proposed methodology has been implemented in a prototype, named as EmuFuzzer, and it was used to test five state-of-the-art IA-32 CPU emulators. EmuFuzzer discovered minor and major defects in each of the tested emulators, thus demonstrating the effectiveness of the proposed approach.

REFERENCES

BAYER, U., KRUEGEL, C., AND KIRDA, E. 2006. TTAnalyze: A tool for analyzing malware. In Procedings of the 15th European Institute for Computer Antivirus Research Annual Conference (EICAR’06).

BELLARD, F. 2005. QEMU, a fast and portable dynamic translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference. USENIX Association.

CADAR, C., GANESH, V., PAWLOWSKI, P. M., DILL, D. L., AND ENGLER, D. R. 2006. EXE: Automatically generating inputs of death. In Proceedings of the 13th ACM Conference on Computer and Communications Security. ACM.

DANIEL, B., DIG, D., GARCIA, K., AND MARINOV, D. 2007. Automated testing of refactoring engines. In Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM.

DEMOTT, J. 2006. The evolving art of fuzzing. Def. Con 14. DINABURG, A., ROYAL, P., SHARIF, M., AND LEE, W. 2008. Ether: Malware analysis via hardware virtualization

extensions. In Proceedings of the 15th ACM conference on Computer and Communications Security. ACM. FERRIE, P. 2006. Attacks on virtual machine emulators. Tech. rep., Symantec Advanced Threat Research. GARFINKEL, T. AND ROSENBLUM, M. 2003. A virtual machine introspection based architecture for intrusion

detection. In Proceedings of Network and Distributed Systems Security Symposium (NDSS). The Internet Society, San Diego, CA.

GODEFROID, P., LEVIN, M. Y., AND MOLNAR, D. 2008. Automated whitebox fuzz testing. In Proceedings of the Network and Distributed System Security Symposium (NDSS). The Internet Society, San Diego, CA.

GOOGLE INC. 2011. Android emulator. http://code.google.com/android/reference/emulator.html. INTEL. 2008. Intel 64 and IA-32 Architectures Software Developer’s Manual. Intel. Instruction Set Reference. KAKSONEN, R. 2001. A functional method for assessing protocol implementation security. Tech. rep., VTT

Electronics. KOZIOL, J., LITCHFIELD, D., AITEL, D., ANLEY, C., EREN, S., MEHTA, N., AND HASSELL, R. 2004. The Shellcoder’s

Handbook: Discovering and Exploiting Security Holes. Wiley. LAHIRI, S. K., VASWANI, K., AND HOARE, C. A. R. 2010. Differential static analysis: opportunities, applications,

and challenges. In Proceedings of the Workshop on Future of Software Engineering Research (FoSER’10), ACM, 201–204.

LAWTON, K. P. 1996. Bochs: A portable PC emulator for unix/x. Linux J. 1996, 29es. LICHSTEIN, H. A. 1969. When should you emulate? Datamation 11, 205–210. LUK, C., COHN, R., MUTH, R., PATIL, H., KLAUSER, A., LOWNEY, G., WALLACE, S., REDDI, V. J., AND HAZELWOOD, K.

2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). ACM.

MAGNUSSON, P. S., CHRISTENSSON, M., ESKILSON, J., FORSGREN, D., HALLBERG, G., HÖGBERG, J., LARSSON, F., MOESTEDT, A., AND WERNER, B. 2002. Simics: A full system simulation platform. Computer 35, 50–58.

MAJUMDAR, R. AND SEN, K. 2007. Hybrid concolic testing. In Proceedings of the 29th International Conference on Software Engineering (ICSE). IEEE Computer Society.

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.

29:26 L. Martignoni et al.

MARTIGNONI, L., PALEARI, R., FRESI ROGLIA, G., AND BRUSCHI, D. 2009. Testing CPU emulators. In Proceedings of the International Conference on Software Testing and Analysis (ISSTA). ACM, 261–272.

MARTIGNONI, L., PALEARI, R., FRESI ROGLIA, G., AND BRUSCHI, D. 2010. Testing system virtual machines. In Proceedings of the International Symposium on Testing and Analysis (ISSTA).

MARTIGNONI, L., STINSON, E., FREDRIKSON, M., JHA, S., AND MITCHELL, J. C. 2008. A layered architecture for detecting malicious behaviors. In Proceedings of the International Symposium on Recent Advances in Intrusion Detection (RAID). Lecture Notes in Computer Science, Springer, Berlin.

MCKEEMAN, W. M. 1998. Differential testing for software. Digital Tech. J. 10, 1, 100–107. MILLER, B. P., FREDRIKSON, L., AND SO, B. 1990. An empirical study of the reliability of UNIX utilities. Comm.

ACM 33, 12. MYERS, G. J. 1978. The Art of Software Testing. Wiley. netbsd64 2011. NetBSD/amd64. http://www.netbsd.org/ports/amd64/. NETHERCOTE, N. 2004. Dynamic binary analysis and instrumentation. Ph.D. thesis, Computer Laboratory,

University of Cambridge, UK. OBERHEIDE, J. AND MILLER, C. 2012. Dissecting the Android bouncer. SummerCon, Brooklyn, NY. http://jon.

oberheide.org/files/summercon12-bouncer.pdf. ORMANDY, T. 2007. An empirical study into the security exposure to host of hostile virtualized environments.

In Proceedings of CanSecWest Applied Security Conference. PALEARI, R., MARTIGNONI, L., FRESI ROGLIA, G., AND BRUSCHI, D. 2010. N-version disassembly: Differential

testing of ×86 disassemblers. In Proceedings of the International Symposium on Testing and Analysis (ISSTA). ACM.

PALEARI, R., MARTIGNONI, L., REINA, A., FRESI ROGLIA, G., AND BRUSCHI, D. 2011. EmuFuzzer Red-Pills Archive. http://security.di.unimi.it/emufuzzer.html.

PERSON, S., DWYER, M. B., ELBAUM, S., AND PǍSǍREANU, C. S. 2008. Differential symbolic execution. In Proceed- ings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 226–237.

PRESTON, I., NEWMAN, R., AND TSENG, J. 2007. JPC: The pure Java ×86 PC emulator. http://jpc.sourceforge. net/home home.html.

QUIST, D. AND SMITH, V. 2006. Detecting the presence of virtual machines using the local data table. Offensive Computing. http://www.offensivecomputing.net/files/active/0.vm.pdf.

RAFFETSEDER, T., KRUEGEL, C., AND KIRDA, E. 2007. Detecting system emulators. In Proceedings of the Information Security Conference (ISC’07). Springer-Verlag.

ROBIN, J. S. AND IRVINE, C. E. 2000. Analysis of the intel pentium’s ability to support a secure virtual machine monitor. In Proceedings of the 9th Conference on USENIX Security Symposium (SSYMM’00). USENIX Association.

RUTKOWSKA, J. 2004. Red Pill. . . or how to detect VMM using (almost) one CPU instruction. http://www.ouah. org/Red %20Pill.html.

SANTELICES, R. A. AND HARROLD, M. J. 2010. Exploiting program dependencies for scalable multiple-path symbolic execution. In Procedings of the International Symposium on Software Testing and Analysis (ISSTA’10). ACM, 195–206.

SEN, K., MARINOV, D., AND AGHA, G. 2005. CUTE: A concolic unit testing engine for C. In Proceedings of the 10th European Software Engineering Conference. ACM.

SUN MICROSYSTEM. 2011. VirtualBox. http://www.virtualbox.org. SUTTON, M., GREENE, A., AND AMINI, P. 2007. Fuzzing: Brute Force Vulnerability Discovery. Addison-Wesley

Professional. TANEJA, K., XIE, T., TILLMANN, N., DE HALLEUX, J., AND DE HALLEUX, P. 2011. eXpress: Guided path exploration

for efficient regression test generation. In Procedings of the 2011 International Symposium on Software Testing and Analysis (ISSTA’11). ACM.

XU, Z., KIM, Y., KIM, M., ROTHERMEL, G., AND COHEN, M. B. 2010. Directed test suite augmentation: techniques and tradeoffs. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 257–266.

Received October 2010; revised April 2012 and July 2012; accepted July 2012

ACM Transactions on Software Engineering and Methodology, Vol. 22, No. 4, Article 29, Pub. date: October 2013.