Application 2 – Annotated Bibliography
54 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
practice
I M
A G
E B
Y V
I T
E Z
S L
A V
V A
L K
A
T H E H E T E R O G E N E I T Y, C O M P L E X I T Y, and scale of cloud applications make verification of their fault tolerance properties challenging. Companies are moving away from formal methods and toward large-scale testing in which components are deliberately compromised to identify weaknesses in the software. For example, techniques such as Jepsen apply fault-injection testing to distributed data stores, and Chaos Engineering performs fault injection experiments on production systems, often on live traffic. Both approaches have captured the attention of industry and academia alike.
Unfortunately, the search space of distinct fault combinations that an infrastructure can test is intractable. Existing failure-testing solutions require skilled and intelligent users who can supply the faults to inject. These superusers, known as Chaos Engineers
and Jepsen experts, must study the sys- tems under test, observe system execu- tions, and then formulate hypotheses about which faults are most likely to expose real system-design flaws. This approach is fundamentally unscal- able and unprincipled. It relies on the superuser’s ability to interpret how a distributed system employs redun- dancy to mask or ameliorate faults and, moreover, the ability to recognize the insufficiencies in those redundan- cies—in other words, human genius.
This article presents a call to arms for the distributed systems research community to improve the state of the art in fault tolerance testing. Ordinary users need tools that au- tomate the selection of custom-tai- lored faults to inject. We conjecture that the process by which superusers select experiments—observing execu- tions, constructing models of system redundancy, and identifying weak- nesses in the models—can be effec- tively modeled in software. The ar- ticle describes a prototype validating this conjecture, presents early results from the lab and the field, and identi- fies new research directions that can make this vision a reality.
The Future Is Disorder Providing an “always-on” experience for users and customers means that distributed software must be fault tol- erant—that is to say, it must be writ- ten to anticipate, detect, and either mask or gracefully handle the effects of fault events such as hardware fail- ures and network partitions. Writing fault-tolerant software—whether for distributed data management systems involving the interaction of a handful of physical machines, or for Web ap- plications involving the cooperation of tens of thousands—remains extremely difficult. While the state of the art in verification and program analysis con- tinues to evolve in the academic world, the industry is moving very much in the opposite direction: away from for- mal methods (however, with some noteworthy exceptions,41) and toward
Abstracting the Geniuses Away from Failure Testing
D O I : 1 0 . 1 1 4 5 / 3 1 5 2 4 8 3
Article development led by queue.acm.org
Ordinary users need tools that automate the selection of custom-tailored faults to inject.
BY PETER ALVARO AND SEVERINE TYMON
J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 55
56 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
practice
up the stack and frustrate any attempts at abstraction.
The Old Guard. The modern myth: Formally verified distributed compo- nents. If we cannot rely on geniuses to hide the specter of partial failure, the next best hope is to face it head on, armed with tools. Until quite recently, many of us (academics in particular) looked to formal methods such as model checking16,20,29,39,40,53,54 to assist “mere mortal” programmers in writ- ing distributed code that upholds its guarantees despite pervasive uncer- tainty in distributed executions. It is not reasonable to exhaustively search the state space of large-scale systems (one cannot, for example, model check Netflix), but the hope is that modularity and composition (the next best tools for conquering complexity) can be brought to bear. If individual distributed components could be formally verified and combined into systems in a way that preserved their guarantees, then global fault toler- ance could be obtained via composi- tion of local fault tolerance.
Unfortunately, this, too, is a pipe dream. Most model checkers require a formal specification; most real-world systems have none (or have not had one since the design phase, many versions ago). Software model checkers and oth- er program-analysis tools require the source code of the system under study. The accessibility of source code is also an increasingly tenuous assumption. Many of the data stores targeted by tools such as Jepsen are closed source; large-scale architectures, while typical- ly built from open source components, are increasingly polyglot (written in a wide variety of languages).
Finally, even if you assume that spec- ifications or source code are available, techniques such as model checking are not a viable strategy for ensuring that applications are fault tolerant because, as mentioned, in the context of time- outs, fault tolerance itself is an end-to- end property that does not necessarily hold under composition. Even if you are lucky enough to build a system out of individually verified components, it does not follow the system is fault toler- ant—you may have made a critical error in the glue that binds them.
The Vanguard. The emerging ethos: YOLO. Modern distributed systems
approaches that combine testing with fault injection.
Here, we describe the underlying causes of this trend, why it has been successful so far, and why it is doomed to fail in its current practice.
The Old Gods. The ancient myth: Leave it to the experts. Once upon a time, distributed systems researchers and practitioners were confident that the responsibility for addressing the problem of fault tolerance could be relegated to a small priesthood of ex- perts. Protocols for failure detection, recovery, reliable communication, consensus, and replication could be implemented once and hidden away in libraries, ready for use by the layfolk.
This has been a reasonable dream. After all, abstraction is the best tool for overcoming complexity in com- puter science, and composing reliable systems from unreliable components is fundamental to classical system design.33 Reliability techniques such as process pairs18 and RAID45 dem- onstrate that partial failure can, in certain cases, be handled at the low- est levels of a system and successfully masked from applications.
Unfortunately, these approaches rely on failure detection. Perfect failure detectors are impossible to implement in a distributed system,9,15 in which it is impossible to distinguish between delay and failure. Attempts to mask the fundamental uncertainty arising from partial failure in a distributed system—for example, RPC (remote procedure calls8) and NFS (network file system49)—have met (famously) with difficulties. Despite the broad consen- sus that these attempts are failed ab- stractions,28 in the absence of better abstractions, people continue to rely on them to the consternation of devel- opers, operators, and users.
In a distributed system—that is, a system of loosely coupled components interacting via messages—the failure of a component is only ever manifested as the absence of a message. The only way to detect the absence of a message is via a timeout, an ambiguous signal that means either the message will nev- er come or that it merely has not come yet. Timeouts are an end-to-end con- cern28,48 that must ultimately be man- aged by the application. Hence, partial failures in distributed systems bubble
While the state of the art in verification and program analysis continues to evolve in the academic world, the industry is moving in the opposite direction: away from formal methods and toward approaches that combine testing with fault injection.
J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 57
practice
are simply too large, too heteroge- neous, and too dynamic for these classic approaches to software qual- ity to take root. In reaction, practitio- ners increasingly rely on resiliency techniques based on testing and fault injection.6,14,19,23,27,35 These “black box” approaches (which perturb and ob- serve the complete system, rather than its components) are (arguably) better suited for testing an end-to- end property such as fault tolerance. Instead of deriving guarantees from understanding how a system works on the inside, testers of the system observe its behavior from the outside, building confidence that it functions correctly under stress.
Two giants have recently emerged in this space: Chaos Engineering6 and Jepsen testing.24 Chaos Engineering, the practice of actively perturbing pro- duction systems to increase overall site resiliency, was pioneered by Netflix,6 but since then LinkedIn,52 Microsoft,38 Uber,47 and PagerDuty5 have developed Chaos-based infrastructures. Jepsen performs black box testing and fault injection on unmodified distributed data management systems, in search of correctness violations (for example, counterexamples that show an execu- tion was not linearizable).
Both approaches are pragmatic and empirical. Each builds an understand- ing of how a system operates under faults by running the system and observ- ing its behavior. Both approaches offer a pay-as-you-go method to resiliency: the initial cost of integration is low, and the more experiments that are performed, the higher the confidence that the system under test is robust. Because these approaches represent a straightforward enrichment of exist- ing best practices in testing with well- understood fault injection techniques, they are easy to adopt. Finally, and perhaps most importantly, both ap- proaches have been shown to be effec- tive at identifying bugs.
Unfortunately, both techniques also have a fatal flaw: they are manual processes that require an extremely sophisticated operator. Chaos Engi- neers are a highly specialized subclass of site reliability engineers. To devise a custom fault injection strategy, a Chaos Engineer typically meets with different service teams to build an
understanding of the idiosyncrasies of various components and their in- teractions. The Chaos Engineer then targets those services and interactions that seem likely to have latent fault tol- erance weaknesses. Not only is this ap- proach difficult to scale since it must be repeated for every new composition of services, but its critical currency— a mental model of the system under study—is hidden away in a person’s brain. These points are reminiscent of a bigger (and more worrying) trend in industry toward reliability priest- hoods,7 complete with icons (dash- boards) and rituals (playbooks).
Jepsen is in principle a framework that anyone can use, but to the best of our knowledge all of the reported bugs discovered by Jepsen to date were dis- covered by its inventor, Kyle Kingsbury, who currently operates a “distributed systems safety research” consultancy.24 Applying Jepsen to a storage system requires the superuser carefully read the system documentation, generate workloads, and observe the externally visible behaviors of the system under test. It is then up to the operator to choose—from the massive combina- torial space of “nemeses,” including machine crashes and network parti- tions—those fault schedules that are likely to drive the system into returning incorrect responses.
A human in the loop is the kiss of death for systems that need to keep up with software evolution. Human atten- tion should always be targeted at tasks that computers cannot do! Moreover, the specialists that Chaos and Jepsen testing require are expensive and rare. Here, we show how geniuses can be ab- stracted away from the process of fail- ure testing.
We Don’t Need Another Hero Rapidly changing assumptions about our visibility into distributed system internals have made obsolete many if not all of the classic approaches to software quality, while emerging “cha- os-based” approaches are fragile and unscalable because of their genius-in- the-loop requirement.
We present our vision of automated failure testing by looking at how the same changing environments that has- tened the demise of time-tested resil- iency techniques can enable new ones.
We argue the best way to automate the experts out of the failure-testing loop is to imitate their best practices in soft- ware and show how the emergence of sophisticated observability infrastruc- ture makes this possible.
The order is rapidly fadin.’ For large- scale distributed systems, the three fundamental assumptions of tradi- tional approaches to software quality are quickly fading in the rearview mir- ror. The first to go was the belief that you could rely on experts to solve the hardest problems in the domain. Sec- ond was the assumption that a formal specification of the system is available. Finally, any program analysis (broadly defined) that requires that source code is available must be taken off the ta- ble. The erosion of these assumptions helps explain the move away from clas- sic academic approaches to resiliency in favor of the black box approaches described earlier.
What hope is there of understand- ing the behavior of complex systems in this new reality? Luckily, the fact that it is more difficult than ever to understand distributed systems from the inside has led to the rapid evolu- tion of tools that allow us to under- stand them from the outside. Call- graph logging was first described by Google;51 similar systems are in use at Twitter,4 Netflix,1 and Uber,50 and the technique has since been stan- dardized.43 It is reasonable to assume that a modern microservice-based Internet enterprise will already have instrumented its systems to collect call-graph traces. A number of start- ups that focus on observability have recently emerged.21,34 Meanwhile, provenance collection techniques for data processing systems11,22,42 are becoming mature, as are operating system-level provenance tools.44 Re- cent work12,55 has attempted to infer causal and communication structure of distributed computations from raw logs, bringing high-level explana- tions of outcomes within reach even for uninstrumented systems.
Regarding testing distributed systems. Chaos Monkey, like they mention, is awe- some, and I also highly recommend get- ting Kyle to run Jepsen tests.
—Commentator on HackerRumor
58 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
practice
of properties that are either maintained throughout the system’s execution (for example, system invariants or safety properties) or established during execu- tion (for example, liveness properties). Most distributed systems with which we interact, though their executions may be unbounded, nevertheless pro- vide finite, bounded interactions that have outcomes. For example, a broad- cast protocol may run “forever” in a re- active system, but each broadcast deliv- ered to all group members constitutes a successful execution.
By viewing distributed systems in this way, we can revise the definition: A system is fault tolerant if it provides sufficient mechanisms to achieve its successful outcomes despite the given class of faults.
Step 3: Formulate experiments that target weaknesses in the façade. If we could understand all of the ways in which a system can obtain its good outcomes, we could understand which faults it can tolerate (or which faults it could be sensitive to). We assert that (whether they realize it or not!) the process by which Chaos Engineers and Jepsen superusers determine, on a system-by-system basis, which faults to inject uses precisely this kind of rea- soning. A target experiment should exercise a combination of faults that knocks out all of the supports for an ex- pected outcome.
Carrying out the experiments turns out to be the easy part. Fault injection infrastructure, much like observability infrastructure, has evolved rapidly in recent years. In contrast to random, coarse-grained approaches to distrib- uted fault injection such as Chaos Monkey,23 approaches such as FIT (failure injection testing)17 and Grem- lin32 allow faults to be injected at the granularity of individual requests with high precision.
Step 4. Profit! This process can be ef- fectively automated. The emergence of sophisticated tracing tools described earlier makes it easier than ever to build redundancy models even from the executions of black box systems. The rapid evolution of fault injection infrastructure makes it easier than ever to test fault hypotheses on large- scale systems. Figure 1 illustrates how the automation described in this here fits neatly between existing observ-
Away from the experts. While this quote is anecdotal, it is difficult to imagine a better example of the fun- damental unscalability of the current state of the art. A single person can- not possibly keep pace with the ex- plosion of distributed system imple- mentations. If we can take the human out of this critical loop, we must; if we cannot, we should probably throw in the towel.
The first step to understanding how to automate any process is to compre- hend the human component that we would like to abstract away. How do Chaos Engineers and Jepsen superus- ers apply their unique genius in prac- tice? Here is the three-step recipe com- mon to both approaches.
Step 1: Observe the system in action. The human element of the Chaos and Jepsen processes begins with princi- pled observation, broadly defined.
A Chaos Engineer will, after study- ing the external API of services rel- evant to a given class of interactions, meet with the engineering teams to better understand the details of the implementations of the individual
services.25 To understand the high- level interactions among services, the engineer will then peruse call-graph traces in a trace repository.3
A Jepsen superuser typically begins by reviewing the product documenta- tion, both to determine the guarantees that the system should uphold and to learn something about the mecha- nisms by which it does so. From there, the superuser builds a model of the behavior of the system based on inter- action with the system’s external API. Since the systems under study are typ- ically data management and storage, these interactions involve generating histories of reads and writes.31
The first step to understanding what can go wrong in a distributed system is watching things go right: observing the system in the common case.
Step 2. Build a mental model of how the system tolerates faults. The com- mon next step in both approaches is the most subtle and subjective. Once there is a mental model of how a dis- tributed system behaves (at least in the common case), how is it used to help choose the appropriate faults to inject? At this point we are forced to dabble in conjecture: bear with us.
Fault tolerance is redundancy. Giv- en some fixed set of faults, we say that a system is “fault tolerant” exactly if it operates correctly in all executions in which those faults occur. What does it mean to “operate correctly”? Correct- ness is a system-specific notion, but, broadly speaking, is expressed in terms
Figure 1. Our vision of automated failure testing.
explanations models
of redundancy
fault injection
Figure 2. Fault injection and fault-tolerant code.
APP1 APP1 APP2 APP2 caller
fault
callee
API API API API API
J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 59
practice
ability infrastructure and fault injec- tion infrastructure, consuming the former, maintaining a model of system redundancy, and using it to param- eterize the latter. Explanations of sys- tem outcomes and fault injection in- frastructures are already available. In the current state of the art, the puzzle piece that fits them together (models of redundancy) is a manual process. LDFI (as we will explain) shows that automa- tion of this component is possible.
A Blast from the Past In previous work, we introduced a bug- finding tool called LDFI (lineage-driven fault injection).2 LDFI uses data prove- nance collected during simulations of distributed executions to build deriva- tion graphs for system outcomes. These graphs function much like the models of system redundancy described ear- lier. LDFI then converts the derivation graphs into a Boolean formula whose satisfying assignments correspond to combinations of faults that invalidate all derivations of the outcome. An ex- periment targeting those faults will then either expose a bug (that is, the ex- pected outcome fails to occur) or reveal additional derivations (for example, af- ter a timeout, the system fails over to a backup) that can be used to enrich the model and constrain future solutions.
At its heart, LDFI reapplies well- understood techniques from data management systems, treating fault tolerance as a materialized view main- tenance problem.2,13 It models a dis- tributed system as a query, its expect- ed outcomes as query outcomes, and critical facts such as “replica A is up at time t” and “there is connectivity be- tween nodes X and Y during the inter- val i . . . j” as base facts. It can then ask a how-to query:37 What changes to base data will cause changes to the derived data in the view? The answers to this query are the faults that could, accord- ing to the current model, invalidate the expected outcomes.
The idea seems far-fetched, but the LDFI approach shows a great deal of promise. The initial prototype demon- strated the efficacy of the approach at the level of protocols, identifying bugs in replication, broadcast, and commit protocols.2,46 Notably, LDFI reproduced a bug in the replication protocol used by the Kafka distributed log26 that was first
(manually) identified by Kingsbury.30 A later iteration of LDFI is deployed at Netflix,1 where (much like the illustra- tion in Figure 1) it was implemented as a microservice that consumes traces from a call-graph repository service and provides inputs for a fault injection ser- vice. Since its deployment, LDFI has identified 11 critical bugs in user-fac- ing applications at Netflix.1
Rumors from the Future The prior research presented earlier is only the tip of the iceberg. Much work still needs to be undertaken to realize the vision of fully automated failure testing for distributed systems. Here, we highlight nascent research that shows promise and identifies new di- rections that will help realize our vision.
Don’t overthink fault injection. In the context of resiliency testing for distribut- ed systems, attempting to enumerate and faithfully simulate every possible kind of fault is a tempting but dis- tracting path. The problem of under- standing all the causes of faults is not directly relevant to the target, which is to ensure that code (along with its configuration) intended to detect and mitigate faults performs as expected.
Consider Figure 2: The diagram on the left shows a microservice-based architecture; arrows represent calls generated by a client request. The right-hand side zooms in on a pair of interacting services. The shaded box in the caller service represents the fault tolerance logic that is intended to detect and handle faults of the cal- lee. Failure testing targets bugs in this logic. The fault tolerance logic targeted in this bug search is represented as the shaded box in the caller service, while the injected faults affect the callee.
The common effect of all faults, from the perspective of the caller, is explicit error returns, corrupted responses, and (possibly infinite) delay. Of these manifestations, the first two can be ad- equately tested with unit tests. The last is difficult to test, leading to branches of code that are infrequently executed. If we inject only delay, and only at com- ponent boundaries, we conjecture that we can address the majority of bugs re- lated to fault tolerance.
Explanations everywhere. If we can provide better explanations of system outcomes, we can build better models
The rapid evolution of fault injection infrastructure makes it easier than ever to test fault hypotheses on large-scale systems.
60 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
practice
to embrace (rather than abstracting away) this uncertainty.
Distributed systems are probabi- listic by nature and are arguably bet- ter modeled probabilistically. Future directions of work include the proba- bilistic representation of system re- dundancy and an exploration of how this representation can be exploited to guide the search of fault experiments. We encourage the research community to join in exploring alternative internal representations of system redundancy.
Turning the explanations inside out. Most of the classic work on data provenance in database research has focused on aspects related to human- computer interaction. Explanations of why a query returned a particular result can be used to debug both the query and the initial database—given an un- expected result, what changes could be made to the query or the database to fix it? By contrast, in the class of systems we envision (and for LDFI concretely), explanations are part of the internal language of the reasoner, used to con- struct models of redundancy in order to drive the search through faults.
Ideally, explanations should play a role in both worlds. After all, when a bug-finding tool such as LDFI identi- fies a counterexample to a correctness property, the job of the programmers has only just begun—now they must un- dertake the onerous job of distributed debugging. Tooling around debugging has not kept up with the explosive pace of distributed systems development. We continue to use tools that were de- signed for a single site, a uniform mem- ory, and a single clock. While we are not certain what an ideal distributed debug- ger should look like, we are quite certain that it does not look like GDB (GNU Proj- ect debugger).36 The derivation graphs used by LDFI show how provenance can also serve a role in debugging by provid- ing a concise, visual explanation of how the system reached a bad state.
This line of research can be pushed further. To understand the root causes of a bug in LDFI, a human operator must review the provenance graphs of the good and bad executions and then examine the ways in which they differ. Intuitively, if you could abstractly subtract the (incomplete by assump- tion) explanations of the bad outcomes from the explanations of the good out-
of redundancy. Unfortunately, a bar- rier to entry for systems such as LDFI is the unwillingness of software de- velopers and operators to instrument their systems for tracing or provenance collection. Fortunately, operating sys- tem-level provenance-collection tech- niques are mature and can be applied to uninstrumented systems.
Moreover, the container revolution makes simulating distributed execu- tions of black box software within a single hypervisor easier than ever. We are actively exploring the collection of system call-level provenance from unmodified distributed software in order to select a custom-tailored fault injection schedule. Doing so requires extrapolating application-level causal structure from low-level traces, iden- tifying appropriate cut points in an observed execution, and finally syn- chronizing the execution with fault injection actions.
We are also interested in the pos- sibility of inferring high-level explana- tions from even noisier signals, such as raw logs. This would allow us to relax the assumption that the systems un- der study have been instrumented to collect execution traces. While this is a difficult problem, work such as the Mystery Machine12 developed at Face- book shows great promise.
Toward better models. The LDFI system represents system redundancy using derivation graphs and treats the task of identifying possible bugs as a materialized-view maintenance prob- lem. LDFI was hence able to exploit well-understood theory and mecha- nisms from the history of data man- agement systems research. But this is just one of many ways to represent how a system provides alternative computa- tions to achieve its expected outcomes.
A shortcoming of the LDFI approach is its reliance on assumptions of de- terminism. In particular, it assumes that if it has witnessed a computation that, under a particular contingency (that is, given certain inputs and in the presence of certain faults), produces a successful outcome, then any future computation under that contingency will produce the same outcome. That is to say, it ignores the uncertainty in timing that is fundamental to distrib- uted systems. A more appropriate way to model system redundancy would be
The container revolution makes simulating distributed executions of black-box software within a single hypervisor easier than ever.
J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 61
practice
36. Matloff, N., Salzman, P.J. The Art of Debugging with GDB, DDD, and Eclipse. No Starch Press, 2008.
37. Meliou, A., Suciu, D. Tiresias: The database oracle for how-to queries. Proceedings of the ACM SIGMOD International Conference on the Management of Data (2012), 337-348.
38. Microsoft Azure Documentation. Introduction to the fault analysis service, 2016; https://azure.microsoft. com/en-us/documentation/articles/ service-fabric- testability-overview/.
39. Musuvathi, M. et al. CMC: A pragmatic approach to model checking real code. ACM SIGOPS Operating Systems Review. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation 36 (2002), 75–88.
40. Musuvathi, M. et al. Finding and reproducing Heisenbugs in concurrent programs. In Proceedings of the 8th Usenix Conference on Operating Systems Design and Implementation (2008), 267–280.
41. Newcombe, C. et al. Use of formal methods at Amazon Web Services. Technical Report, 2014; http:// lamport.azurewebsites.net/tla/formal-methods- amazon.pdf.
42. Olston, C., Reed, B. Inspector Gadget: A framework for custom monitoring and debugging of distributed data flows. In Proceedings of the ACM SIGMOD International Conference on the Management of Data (2011), 1221–1224.
43. OpenTracing. 2016; http://opentracing.io/. 44. Pasquier, T.F. J.-M., Singh, J., Eyers, D.M., Bacon, J.
CamFlow: Managed data-sharing for cloud services, 2015; https://arxiv.org/pdf/1506.04391.pdf.
45. Patterson, D.A., Gibson, G., Katz, R.H. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, 109–116; http://web.mit.edu/6.033/2015/wwwdocs/papers/ Patterson88.pdf.
46. Ramasubramanian, K. et al. Growing a protocol. In Proceedings of the 9th Usenix Workshop on Hot Topics in Cloud Computing (2017).
47. Reinhold, E. Rewriting Uber engineering: The opportunities microservices provide. Uber Engineering, 2016; https: //eng.uber.com/building-tincup/.
48. Saltzer, J. H., Reed, D.P., Clark, D.D. End-to-end arguments in system design. ACM Trans. Computing Systems 2, 4 (1984): 277–288.
49. Sandberg, R. The Sun network file system: design, implementation and experience. Technical report, Sun Microsystems. In Proceedings of the Summer 1986 Usenix Technical Conference and Exhibition.
50. Shkuro, Y. Jaeger: Uber’s distributed tracing system. Uber Engineering, 2017; https://uber.github.io/jaeger/.
51. Sigelman, B.H. et al. Dapper, a large-scale distributed systems tracing infrastructure. Technical report. Research at Google, 2010; https://research.google. com/pubs/pub36356.html.
52. Shenoy, A. A deep dive into Simoorg: Our open source failure induction framework. Linkedin Engineering, 2016; https://engineering.linkedin.com/blog/2016/03/ deep-dive-Simoorg-open-source-failure-induction- framework.
53. Yang, J. et al.L., Zhou, L. MODIST: Transparent model checking of unmodifed distributed systems. In Proceedings of the 6th Usenix Symposium on Networked Systems Design and Implementation (2009), 213–228.
54. Yu, Y., Manolios, P., Lamport, L. Model checking TLA+ specifications. In Proceedings of the 10th IFIP WG 10.5 Advanced Research Working Conference on Correct Hardware Design and Verification Methods (1999), 54–66.
55. Zhao, X. et al. Lprof: A non-intrusive request flow profiler for distributed systems. In Proceedings of the 11th Usenix Conference on Operating Systems Design and Implementation (2014), 629–644.
Peter Alvaro is an assistant professor of computer science at the University of California Santa Cruz, where he leads the Disorderly Labs research group (disorderlylabs.github.io).
Severine Tymon is a technical writer who has written documentation for both internal and external users of enterprise and open source software, including for Microsoft, CNET, VMware, and Oracle.
Copyright held by owners/authors. Publication rights licensed to ACM. $15.00.
comes,10 then the root cause of the dis- crepancy would be likely to be near the “frontier” of the difference.
Conclusion A sea change is occurring in the tech- niques used to determine whether distributed systems are fault tolerant. The emergence of fault injection ap- proaches such as Chaos Engineering and Jepsen is a reaction to the erosion of the availability of expert program- mers, formal specifications, and uni- form source code. For all of their prom- ise, these new approaches are crippled by their reliance on superusers who decide which faults to inject.
To address this critical shortcom- ing, we propose a way of modeling and ultimately automating the process carried out by these superusers. The enabling technologies for this vision are the rapidly improving observabil- ity and fault injection infrastructures that are becoming commonplace in the industry. While LDFI provides con- structive proof that this approach is possible and profitable, it is only the beginning. Much work remains to be done in targeting faults at a finer grain, constructing more accurate models of system redundancy, and providing bet- ter explanations to end users of exactly what went wrong when bugs are identi- fied. The distributed systems research community is invited to join in explor- ing this new and promising domain.
Related articles on queue.acm.org
Fault Injection in Production John Allspaw http://queue.acm.org/detail.cfm?id=2353017
The Verification of a Distributed System Caitie McCaffrey http://queue.acm.org/detail.cfm?id=2889274
Injecting Errors for Fun and Profit Steve Chessin http://queue.acm.org/detail.cfm?id=1839574
References 1. Alvaro, P. et al. Automating failure-testing research
at Internet scale. In Proceedings of the 7th ACM Symposium on Cloud Computing (2016), 17–28.
2. Alvaro, P., Rosen, J., Hellerstein, J.M. Lineage-driven fault injection. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2015), 331–346.
3. Andrus, K. Personal communication, 2016. 4. Aniszczyk, C. Distributed systems tracing with Zipkin.
Twitter Engineering; https://blog.twitter.com/2012/ distributed-systems-tracing-with-zipkin.
5. Barth, D. Inject failure to make your systems more reliable. DevOps.com; http://devops.com/2014/06/03/ inject-failure/.
6. Basiri, A. et al. Chaos Engineering. IEEE Software 33, 3 (2016), 35–41.
7. Beyer, B., Jones, C., Petoff, J., Murphy, N.R. Site Reliability Engineering. O’Reilly, 2016.
8. Birrell, A.D., Nelson, B.J. Implementing remote procedure calls. ACM Trans. Computer Systems 2, 1 (1984), 39–59.
9. Chandra, T.D., Hadzilacos, V., Toueg, S. The weakest failure detector for solving consensus. J.ACM 43, 4 (1996), 685–722.
10. Chen, A. et al. The good, the bad, and the differences: better network diagnostics with differential provenance. In Proceedings of the ACM SIGCOMM Conference (2016), 115–128.
11. Chothia, Z., Liagouris, J., McSherry, F., Roscoe, T. Explaining outputs in modern data analytics. In Proceedings of the VLDB Endowment 9, 12 (2016): 1137–1148.
12. Chow, M. et al. The Mystery Machine: End-to-end performance analysis of large-scale Internet services. In Proceedings of the 11th Usenix Conference on Operating Systems Design and Implementation (2014), 217–231.
13. Cui, Y., Widom, J., Wiener, J.L. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Systems 25, 2 (2000), 179–227.
14. Dawson, S., Jahanian, F., Mitton, T. ORCHESTRA: A Fault Injection Environment for Distributed Systems. In Proceedings of the 26th International Symposium on Fault-tolerant Computing, (1996).
15. Fischer, M.J., Lynch, N.A., Paterson, M.S. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (1985): 374–382; https://groups.csail.mit. edu/tds/papers/Lynch/jacm85.pdf.
16. Fisman, D., Kupferman, O., Lustig, Y. On verifying fault tolerance of distributed protocols. In Tools and Algorithms for the Construction and Analysis of Systems, Lecture Notes in Computer Science 4963, Springer Verlag (2008). 315–331.
17. Gopalani, N., Andrus, K., Schmaus, B. FIT: Failure injection testing. Netflix Technology Blog; http:// techblog.netflix.com/2014/10/fit-failure-injection- testing.html.
18. Gray, J. Why do computers stop and what can be done about it? Tandem Technical Report 85.7 (1985); http://www.hpl.hp.com/techreports/ tandem/TR-85.7.pdf.
19. Gunawi, H.S. et al. FATE and DESTINI: A framework for cloud recovery testing. In Proceedings of the 8th Usenix Conference on Networked Systems Design and Implementation (2011), 238–252; http://db.cs. berkeley.edu/papers/nsdi11-fate-destini.pdf.
20. Holzmann, G. The SPIN Model Checker: Primer and Reference Manual. Addison-Wesley Professional, 2003.
21. Honeycomb. 2016; https://honeycomb.io/. 22. Interlandi, M. et al. Titian: Data provenance support in
Spark. In Proceedings of the VLDB Endowment 9, 33 (2015), 216–227.
23. Izrailevsky, Y., Tseitlin, A. The Netflix Simian Army. Netflix Technology Blog; http: //techblog.netflix. com/2011/07/ netflix-simian-army.html.
24. Jepsen. Distributed systems safety research, 2016; http://jepsen.io/.
25. Jones, N. Personal communication, 2016. 26. Kafka 0.8.0. Apache, 2013; https://kafka.apache.
org/08/documentation.html. 27. Kanawati, G.A., Kanawati, N.A., Abraham, J.A. Ferrari:
A flexible software-based fault and error injection system. IEEE Trans. Computers 44, 2 (1995): 248–260.
28. Kendall, S.C., Waldo, J., Wollrath, A., Wyant, G. A note on distributed computing. Technical Report, 1994. Sun Microsystems Laboratories.
29. Killian, C.E., Anderson, J.W., Jhala, R., Vahdat, A. Life, death, and the critical transition: Finding liveness bugs in systems code. Networked System Design and Implementation, (2007); 243–256.
30. Kingsbury, K. Call me maybe: Kafka, 2013; http:// aphyr.com/posts/293-call-me-maybe-kafka.
31. Kingsbury, K. Personal communication, 2016. 32. Lafeldt, M. The discipline of Chaos Engineering.
Gremlin Inc., 2017; https://blog.gremlininc.com/the- discipline-of-chaos-engineering-e39d2383c459.
33. Lampson, B.W. Atomic transactions. In Distributed Systems—Architecture and Implementation, An Advanced Cours: (1980), 246–265; https://link. springer.com/chapter/10.1007%2F3-540-10571-9_11.
34. LightStep. 2016; http://lightstep.com/. 35. Marinescu, P.D., Candea, G. LFI: A practical and
general library-level fault injector. In IEEE/IFIP International Conference on Dependable Systems and Networks (2009).
Copyright of Communications of the ACM is the property of Association for Computing Machinery and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.