read the article and answer

profileIbrahim abd
p46-donat.pdf

46 February 2005 QUEUE rants: [email protected]

Composing a score can help us manage the complexity

of testing distributed apps.

MICHAEL DONAT, SILICON CHALK

QUEUE February 2005 47 more queue: www.acmqueue.com

NNetworking and the Internet are encouraging increas-ing levels of interaction and collaboration between people and their software. Whether users are playing games or composing legal documents, their applica-tions need to manage the complex interleaving of actions from multiple machines over potentially unreliable connections. As an example, Silicon Chalk is a distributed application designed to enhance the in-class experience of instructors and students. Its distributed nature requires that we test with multiple machines. Manual testing is too tedious, expensive, and inconsistent to be effective. While automating our testing, however, we have found it very labor intensive to maintain a set of scripts describing each machine’s portion of a given test. Maintainability suffers because the test description is spread over several fi les.

Experience has convinced me that testing distrib- uted software requires an automated test lab, where each test description specifi es behavior over multiple test-lab components.

In this article, I address the need for a centralized test description, called a score. I describe the syntax of a score as a means of discussing the issues I have encountered and some methods for addressing them. These ideas are based on our efforts at Silicon Chalk to centralize the automation of as much of our testing as possible. Although other applications may require dif- ferent test-lab infrastructures, I outline what I believe to be the core issues.

Orchestrating an Automated Test Lab

Quality AssuranceFO

C U

S

48 February 2005 QUEUE rants: [email protected]

THE CONGREGATION At Silicon Chalk we need a large test lab. This is only partly because we need to test our application with a variety of hardware. Our software is specifi cally designed for in-class use, where many students interact with an instructor. Classes may contain 20 to 200 laptops con- nected through a wired or wireless network. Silicon Chalk supports collaboration, communication, exercises, note taking, and presentation in face-to-face classes where some or all students have laptops, desktops, or tablet computers. There are many ways for actions on one participant’s machine to have effects on all the others. The collective dynamic behavior of multiple instances of Silicon Chalk, all communicating with each other, requires that it be tested in a multiple-instance environ- ment, rather than testing individual instances.

This is a fundamental difference between distributed applications and more traditional solitary applications such as Excel. For example, two users working with Excel simultaneously need not know of the other’s exis- tence. The only impact they may have on one another is through contention for an external resource (e.g., a spreadsheet). Such contention can usually be resolved by arbitrarily assigning the resource until it is free to be passed on. In contrast, multiple instances of Silicon Chalk create a congregation where each instance has a vested interest in putting information on the network with coordinated care.

In a Silicon Chalk session, information predominantly fl ows from instructor to students. For example, the instructor presents a set of slides, and they appear, one by one, on the students’ displays. At other times during the session, information may fl ow from the students back to the instructor (e.g., the students’ answers to a quiz). So at any given time, there may be many machines trying to put data onto the network.

In particular, a wireless network is quite fragile and can start to leak data as it nears capacity. As the members of the congregation send data on the network, they must pay attention to how much data is leaking. Note that Silicon Chalk is an example of a distributed application

where some resources (e.g., the wireless network) must be managed as a congregation rather than individually.

One of the most important types of tests we perform is the Group Test, which simulates a typical session on a collection of many machines usually connected through a wireless network. We simulate various session activities and measure network utilization to verify performance. We conduct these tests on different numbers of machines and network confi gurations (e.g., 802.11b only, b+g, g only, single access points, multiple access points, etc.).

The variety of wireless network equipment may or may not have an impact on the network’s performance (from Silicon Chalk’s point of view). Different network hardware has the potential to deliver data at differing rates. Thus, different instances will have different data transfer characteristics, which results in different behav- ior. These interactions must be investigated to ensure that a campus with a diverse range of student laptops will be able to conduct satisfactory Silicon Chalk sessions.

Because Silicon Chalk is an interaction of many machines in parallel, testing becomes complex. Each con- fi guration needs its chance as instructor and student. The interaction of Silicon Chalk tools during a session also increases the number of potential cases to be tested.

AUTOMATION Getting through all of this testing manually is a hor- rendous job. Automation is the only viable option. It is important to realize that automation is not limited to the execution of the application being tested. There are several aspects of testing to consider, which are applicable to distributed applications in general:

Component confi guration. The test lab contains a vari- ety of equipment, including computers and the network equipment used for the test. The components need to be confi gured before the test begins. This includes network settings and operating system/software images used on the computers.

Build installation. Most modern software development projects have automated build systems. Once the build has completed, automated testing makes sense. Before this can happen, the latest successful build has to be installed on all computers in the test lab.

Test execution on a given component. I associate actions with components rather than with the applica- tion, because Silicon Chalk also interacts with other appli- cations. As such, actions might refer to other applications and are not limited to Silicon Chalk. The instructor might present another application to the students, or the Class- room Management Tool on a student’s machine might

Quality AssuranceFO

C U

S

Orchestrating Orchestrating an Automated Test Lab

QUEUE February 2005 49 more queue: www.acmqueue.com

report open applications to the instructor. For this reason, the actions to be performed on a machine might include opening and driving other applications before, and after, Silicon Chalk has been started. An agent on a machine is responsible for processing the script for that machine.

In addition, some actions may occur on such hard- ware as access points. For example, this would allow us to ensure that Silicon Chalk degraded gracefully in the face of a degradation of the wireless network.

Synchronization of actions across a number of machines. For repeatability, the actions performed on different machines must be synchronized. For example, students cannot begin answering a quiz before they have received it.

Post-test log file analysis. Once the test has ended, data needs to be collected and analyzed to determine the success of the test. It is possible that performance issues exist, even though the session appeared to proceed normally.

Build installation and post-test analysis will typically be the same for each test, so the specification for a partic- ular test would need to include a component configura- tion together with a set of actions and how those actions are synchronized. For the remainder of this article, I focus on specifying synchronized actions.

One natural approach to scripting synchronized actions is to write one script for each machine and encode synchronization points into each script. A syn- chronization point is like a gate in the script. Actions scripted after the synchronization point can be processed only after all agents have reached that point.

Having multiple scripts per test, however, creates a problem: it makes the test difficult to maintain. The primary problem is readability. Creating a mental picture of what is going on in the test lab by looking at multiple sources is quite cumbersome. It can be done with multi- ple scripts, but it’s an approach that becomes tedious and error prone. A better solution is to write one script in a way that encodes not only the actions performed on each machine, but also how those actions are synchronized.

THE SCORE Let’s consider some desirable properties of a script that refers to multiple test-lab components. I refer to such a script as a score. (See Score Language Constructs sidebar.) The score needs to describe not only a sequence of steps, but also what happens in parallel on different test-lab components. A test-lab component can be a machine, an access point, a switch, etc. Anything that can be automat- ically manipulated that is also relevant to a test should be

scriptable in the score. The score is executed on a server, the conductor, which controls the test lab.

To be maintainable, the score must first be readable and must refer to logical components rather than physi- cal ones. This allows one score to be applied to a wider variety of configurations. Scores might contain a pre- amble that makes the component mapping explicit, or it might be done algorithmically. Either way, the body of the score should contain logical references only.

A key point is that a logical reference might actually refer to a group of machines. We use this approach to simulate a number of students performing a task on dif- ferent computers in parallel. Blocks of instructions can be assigned to a non-empty set of components that is given a name (e.g., groupA in the Score Syntax sidebar). This property makes a score versatile and fault tolerant. A score is written assuming that on any given run of the test, there may be different physical members of groupA. For Silicon Chalk, there are several student machines, but only one instructor. By writing the script with Instructor and Student groups, we have the versatility to have each machine in our test lab behave as the instructor, in turn.

FAILURES Failures may occur on any of the test-lab components during a test. These failures need to be identified so that the cause can be determined. Manually monitoring all of the components of the test lab is infeasible, so errors need to be detected automatically and logged for later analysis. A failure on one component does not mean that the test has failed. Indeed, some tests may be designed to cause failure behavior on various components in order to test robustness.

It is important to distinguish the difference between the recoverable failure on a component, the failure of a component, and the failure of a test. As long as a mini- mal set of functioning components exists in the test lab, there is value in continuing with the test. Since we wish to uncover as many failures as possible with our tests, it is desirable to continue until the minimal set no longer exists.

Some consideration must be given to identifying when a test has failed, so that the test lab can be reset and the next test deployed. Since the script is written with the assumption that there is at least one member of each group, it is reasonable to assume that continuing the test has value until an action is encountered such that one of the groups specified no longer has members.

For example, if the instructor fails, a Silicon Chalk test is essentially dead and there is no point in continuing.

50 February 2005 QUEUE rants: [email protected]

If one student fails, however, we would like to continue the test in order to trap any other failures the test may uncover on other machines. Post-test log-fi le analysis will alert us to the earlier student failure. Even if multiple stu- dent failures occur, the test can continue, so long as there are at least one student and one instructor. Writing the score in this way lets us specify the required minimal set of test-lab components implicitly, and this set can change as the test progresses.

What does component failure mean? A failed com- ponent can no longer be trusted to perform as expected. Once this has happened, it is important to drop the com- ponent from the test because of its potential to introduce red herrings into the test results.

The conductor must identify a failure somehow. This can be through some component-internal means, such as an exception, or it might be through some component- external means, such as a liveness detector. Some failures may mean that the current operation has not succeeded, but the component is still able to function. Others are more serious and represent component failures. Both of these types can be accommodated by exceptions.

For example, consider the score fragment:

try { [groupA] try {

A B

} catch (local_exception) { C }

} catch (component_exception) { D }

The score defi nes code that might run on different components. In this example, the inner try-block is executed on groupA components. Note that D is outside the scope of the groupA designation, so D would execute

on the conductor D if an exception were thrown by C. A local failure occurring during A or B on a groupA compo- nent would cause the component to continue at C, but the conductor would not execute D because the compo- nent itself did not fail. A more serious exception would be trapped by the outermost try-block, the conductor would execute D, and the test would continue. If there were no outermost try-block, the score would fail, and the next test would be deployed.

NONREPEATABILITY Ideally, we would like all the components in the test lab to execute their actions at precisely the right time to ensure that the test is truly repeatable. When debugging, we want to be able to observe the problematic behav- ior repeatedly to inspect different aspects of the fault. Unfortunately, as we increase the number of components in the test lab, we lose repeatability when we attempt to reproduce a bug. Even if we were able to re-create the correct stimuli across the test-lab components, re-creating the precise synchronized state is incredibly diffi cult.

There is a small delay between the conductor signaling the start of an action and the agent starting the action on the component. This means that, although we want as short a component start-delay as possible, some nonzero delay is expected. Factors that contribute to this delay include: different operating systems on the components, different additional software, different hard-disk fragmen- tation, different processor speeds, and different memory.

Since controlling these variables is impractical, I have found that it is actually desirable to assume these varia- tions for each test run. This gives us confi dence that, over time, we are covering a more realistic number of condi- tions with our tests. Since it is unlikely that we would be able to reproduce a given test run exactly, this means we need to rely on instrumenting the application so that we have as much information as possible at our disposal when fi nding the source of a failure. Logs should contain enough information so that a software developer can understand the context in which a failure has occurred. Some examples of information include code path tags and performance values.

DEADLINES The score syntax says nothing about when execution starts on a given component, nor how long execution on a component will take. All that is guaranteed are the synchronization points. This raises the deadline issue. How long do we wait for a component to arrive at a syn- chronization point? Somehow, we need to determine and

Quality AssuranceFO

C U

S

Orchestrating Orchestrating an Automated Test Lab

QUEUE February 2005 51 more queue: www.acmqueue.com

try <response-average-deadline-factor=5 timelimit=20*60*1000> { [laptops] { reimage(“imagefi le-\1”); install-build(); } }

tests = new Array(test1, test2); // test2 defi ned elsewhere try <response-average-deadline-factor=1.5> { for test in tests { test(); } }

try <response-average-deadline-factor=3 timelimit=4*60*1000> { [laptops] reboot(); sleep(3 * 60 * 1000); [laptops] report-log-summary(“\1”); create-summary-html(); }

function test1() { [instructor, students] Login(); par { [instructor] { StartCourse(course); OpenPresentationHost(); SetPresentationModeFollowTop(); StartPresentation(); Browser.open(“http://www.cbc.ca”); Sleep(10 * 1000); Browser.EnterField(“search”, “quirks”); Sleep(5 * 1000); Browser.Click(“Go”); } [students] { JoinCourse(course); OpenPresentationClient(); } } Sleep(5 * 1000); [Instructor] PowerPointShow(“test.ppt”);

// contains 25 slides, slide transitions every 12 seconds [students] SendQuestion(“This is a test question ” + computername); [instructor, students] { LeaveCourse(); Quit(); } }

Score Language Constructs

The example here illustrates the language constructs discussed in this article.

The function test1 (below right) specifi es the sequen- tial and parallel actions that are to be performed during one test. There would be several such functions, each one representing a separate test.

Above the test specifi ca- tions are instructions on test-lab setup prior to testing (install-build), a section that executes the tests (test1, test2), and fi nally a section that collects results and cre- ates summaries.

52 February 2005 QUEUE rants: [email protected]

impose a deadline we can use to deem components failed. If a component misses the deadline at a synchroniza-

tion point, it is dropped from the test. Indeed, since the

component is now out of sync with and running outside the assumptions of the score, the corresponding agent must deactivate the component so that it cannot con- taminate further test results.

At Silicon Chalk, our scripting language is based on user interface events that occur at prescribed intervals. This makes it simple to compute a deadline for a response from a component. The deadline is simply the time at which the next event would occur if there were one.

Since we assume that the application can process user interface events within the time interval specifi ed in the script, any additional waiting that needs to happen is

Score Syntax

To specify a sequence of actions, we can use the normal approach found in most programming languages. Sequence:

A B C

This, of course, means that A is followed by B and then C. To specify that actions happen in parallel, we might use a construct such as:

Par { A B C }

This means actions A, B, and C all start at the same time.

To specify which component performs the actions, we can simply tag the actions with a list of the logical components that will execute them.

[groupA] A [groupB] B [groupA, groupB] C par { [groupA] A [groupB] B }

There are implicit synchronization points between each of the sequential actions (A-B, B-C, C-par). The scripting or programming language used in placeholders A, B, etc. is irrelevant. Indeed, it may be desirable to allow multiple languages, depending on the target components, which may have different agent script engines.

Quality AssuranceFO

C U

S

Orchestrating Orchestrating an Automated Test Lab

A score adds to the versatility and fault tolerance of the scripted tests.

QUEUE February 2005 53 more queue: www.acmqueue.com

specifi ed explicitly. For other scripting environments, other approaches might be more palatable. You may wish to specify the duration explicitly. The inevitable upgrade of test-lab hardware, however, will likely make this a labor-intensive approach.

A statistical approach might be more effective. What is needed is a simple means of specifying the degree of tolerance in the distribution of completion times for an action. For example, suppose that the conductor is con- fi gured to tolerate completions up to 10 percent longer than the average action duration, based on completions received. If the conductor knows that it received action starts from 10 component agents in a group and that it has recently received eight completions, then it can continually recalculate its own deadline and deem certain components failed after 1.1 times average duration has passed. Note that the conductor must wait arbitrarily long for the fi rst component to respond.

To avoid waiting indefi nitely for a single failed com- ponent, a suitably large default deadline might be used. Alternatively, a certain amount of time might be allocated for the test. If the time limit is exceeded, the test is ter- minated. Our Silicon Chalk tests use time limits plus the user interface event schedule approach, though I believe that the average response factor approach is more desir- able. After all, one could envision an enhanced syntax that would allow the optional specifi cation of the average wait factor and hard deadline.

As an example, consider [groupA] 1.5 10 S. This would mean that members of groupA have a maximum of 10 seconds to send an action-completion response to the conductor, and must also beat the deadline that is 1.5 times the current average of received responses. Let’s suppose there are four members of groupA: A1, A2, A3, and A4. If A1 responds at the 4-second mark, then A2, A3, and A4 will be deemed failed unless at least one of them responds before 6 seconds have passed. Suppose that A2 responds at 5 seconds. Now the deadline for A3 and A4 has moved to 6.75 seconds. If A3 responds at 6 seconds, A4 has until 7.5 seconds.

Deadline tolerance can easily be made more rigid. Consider the previous example with a deadline average factor of 1.05. If A1 responds at 4 seconds, at least one more response must be received before 4.2 seconds.

LIMITS AND CHALLENGES A well-designed test lab can economically address most of the testing requirements for the software it was designed to test, but there are always limits. I have chosen not to address a number of scenarios with the Silicon Chalk Test

Lab for practical and economic reasons. Some examples include: very large numbers of laptops conducting sev- eral activities over a large number of access points; and exhaustive interactions with other software and wireless network hardware confi gurations.

For congregate applications such as Silicon Chalk, traditional test scripting is challenging because a descrip- tion of the test is not centralized. A special script, what I call a score, which encodes script and synchronization information, would address this issue. A score also adds to the versatility and fault tolerance of the scripted tests.

Questions that arise when implementing a score-based test lab include: • How much of the user environment needs to be repre-

sented in the test lab? • How do you specify parallel actions in a practical and

maintainable way? • What types of failures need to be dealt with by the test

lab infrastructure? • How are those failures handled? How are deadlines

determined? • How are test results collected, summarized, and ana-

lyzed? • How can the maintenance of the test lab be automated?

In this article, I have suggested some answers to these questions based on my Silicon Chalk perspective. I would expect that different products would favor different sets of answers. Q

LOVE IT, HATE IT? LET US KNOW [email protected] or www.acmqueue.com/forums

MICHAEL DONAT (www3.telus.net/donat) is the director of quality assurance at Silicon Chalk Inc., Vancouver, BC, Can- ada. His interest in software development issues began while working as a software design engineer for Microsoft from 1987 to 1992. He earned his Ph.D. from the University of British Columbia, where his thesis focused on the automated generation of tests from a formalized set of requirements. © 2005 ACM 1542-7730/05/0200