essay
Status Report on Software
Measurement SHARI LAWRENCE PFLEEGER, Systems/Software and Howard University
ROSS JEFFERY, University of New South Wales BILL CURTIS, TeraQuest Metrics
BARBARA KITCHENHAM, Keele University
The most successful measurement programs are ones in which researcher, practitioner, and customer work hand in hand to meet goals and solve problems. But such collaboration is rare. The authors explore the gaps between these groups and point toward ways to bridge them.
n any scientific field, measurement generates quantita- tive descriptions of key processes and products, enabling us to understand behavior and result. This enhanced understanding lets us select better tech- niques and tools to control and improve our processes, products, and resources. Because engineering involves
the analysis of measurements, software engineering cannot become a true engineering discipline unless we build a solid foundation of measurement-based theories.
One obstacle to building this base is the gap between measure- ment research and measurement practice. This status report describes the state of research, the state of the art, and the state of practice of software measurement. It reflects discussion at the Second International Software Metrics Symposium, which we organized. The aim of the symposium is to encourage researchers and practitioners to share their views, problems, and needs, and to work together to define future activities that will address common goals. Discussion at the symposium revealed that participants had different and sometimes conflicting motivations.
I E E E S O F T W A R E 0 7 4 0 - 7 4 5 9 / 9 7 / $ 1 0 . 0 0 © 1 9 9 7 I E E E 3 3
I
.
♦ Researchers, many of whom are in academic environments, are moti- vated by publication. In many cases, highly theoretical results are never tested empirically, new metrics are defined but never used, and new theo-
ries are promulgated but never exer- cised and modified to fit reality.
♦ Practitioners want short-term, useful results. Their projects are in trouble now, and they are not always willing to be a testbed for studies whose results won’t be helpful until the next project. In addition, practitioners are not always willing to make their data available to researchers, for fear that the secrets of technical advantage will be revealed to their competitors.
♦ Customers, who are not always involved as development progresses, feel powerless. They are forced to specify what they need and then can only hope they get what they want.
It is no coincidence that the most successful examples of software mea- surement are the ones where re- searcher, practitioner, and customer work hand in hand to meet goals and solve problems. But such coordination and collaboration are rare, and there are many problems to resolve before reaching that desirable and productive state. To understand how to get there, we begin with a look at the right and wrong uses of measurement.
MEASUREMENT: USES AND ABUSES
Software measurement has existed since the first compiler counted the
number of lines in a program listing. As early as 1974, in an ACM Computing Surveys article, Donald Knuth reported on using measurement data to demon- strate how Fortran compilers can be optimized, based on actual language use rather than theory. Indeed, mea- surement has become a natural part of many software engineering activities.
♦ Developers, especially those involved in large projects with long schedules, use measurements to help them understand their progress toward completion.
♦ Managers look for measurable milestones to give them a sense of pro- ject health and progress toward effort and schedule commitments.
♦ Customers, who often have little control over software production, look to measurement to help determine the quality and functionality of products.
♦ Maintainers use measurement to inform their decisions about reusabili- ty, reengineering, and legacy code replacement.
Proper usage. IEEE Software and other publications have many articles on how measurement can help improve our products, processes, and resources. For example, Ed Weller described how metrics helped to improve the inspec- tion process at Honeywell;1 Wayne Lim discussed how measurement supports Hewlett-Packard’s reuse program, help- ing project managers estimate module reuse and predict the savings in resources that result;2 and Michael Daskalontanakis reported on the use of measurement to improve processes at Motorola.3 In each case, measurement helped make visible what is going on in the code, the development processes, and the project team.
For many of us, measurement has become standard practice. We use structural-complexity metrics to target our testing efforts, defect counts to help us decide when to stop testing, or failure information and operational profiles to assess code reliability. But
we must be sure that the measurement efforts are consonant with our project, process, and product goals; otherwise, we risk abusing the data and making bad decisions.
Real-world abuses. For a look at how dissonance in these goals can create p r o b l e m s , c o n s i d e r a n e x a m p l e described by Michael Evangelist.4 Suppose you measure program size using lines of code or Halstead mea- sures (measures based on the number of operators and operands in a pro- gram). In both cases, common wis- dom suggests that module size be kept small, as short modules are easier to understand than large ones. More- over, as size is usually the key factor in p r e d i c t i n g e f f o r t , s m a l l m o d u l e s should take less time to produce than large ones. However, this metrics- driven approach can lead to increased effort during testing or maintenance. For example, consider the following code segment:
FOR i = 1 to n DO READ (x[i])
Clearly, this code is designed to read a list of n things. But Brian Kernighan and William Plauger, in their classic book The Elements of Programming Style, caution program- mers to terminate input by an end-of- file or marker, rather than using a count. If a count ends the loop and the set being read has more or fewer than n elements, an error condition can result. A simple solution to this prob- lem is to code the read loop like this:
i = 1 WHILE NOT EOF DO
READ (x[i]) i:= i+1
END
This improved code is still easy to read but is not subject to the counting errors of the first code. On the other hand, if we judge the two pieces of code in terms of minimizing size, then
3 4 M A R C H / A P R I L 1 9 9 7
We must ensure that measurement efforts are consonant with our project goals.
.
the first code segment is better than the second. Had standards been set according to size metrics (as some- times happens), the programmer could have been encouraged to keep the code smaller, and the resulting code would have been more difficult to test and maintain.
Another abuse can occur when you use process measures. Scales such as the US Software Engineering Institute’s Capability Maturity Model can be used as an excuse not to imple- ment an activity. For example, man- agers complain that they cannot insti- tute a reuse program because they are only a level 1 on the maturity scale. But reuse is not prohibited at level 1; the CMM suggests that such practices are a greater risk if basic project disci- plines (such as making sensible com- mitments and managing product base- lines) have not been established. If pro- ductivity is a particular project goal, and if a rich code repository exists from previous projects, reuse may be appropriate and effective regardless of your organization’s level.
Roots of abuse. In each case, it is not the metric but the measurement process that is the source of the abuse: The metrics are used without keeping the development goals in mind. In the code-length case, the metrics should be chosen to support goals of testability and maintainability. In the CMM case, the goal is to improve productivity by introducing reuse. Rather than prevent movement, the model should suggest which steps to take first.
Thus, measurement, as any technol- ogy, must be used with care. Any appli- cation of software measurement should not be made on its own. Rather, it should be an integral part of a general assessment or improvement program, where the measures support the goals and help to evaluate the results of the actions. To use measurement properly, we must understand the nature and goals of measurement itself.
MEASUREMENT THEORY
One way of distinguishing between real-world objects or entities is to describe their characteristics. Measure- ment is one such description. A mea- sure is simply a mapping from the real, empirical world to a mathematical world, where we can more easily understand an entity’s attributes and relationship to other entities. The diffi- culty is in how we interpret the mathe- matical behavior and judge what it means in the real world.
None of these notions is particular to software development. Indeed, mea- surement theory has been studied for many years, beginning long before computers were around. But the issues of measurement theory are very impor- tant in choosing and applying metrics to software development.
Scales. Measurement theory holds, as a basic principle, that there are several scales of measurement—nominal, ordi- nal, interval, and ratio—and each cap- tures more information than its prede- cessor. A nominal scale puts items into categories, such as when we identify a programming language as Ada, Cobol, Fortran, or C++. An ordinal scale ranks items in an order, such as when we assign failures a progressive severity like minor, major, and catastrophic.
An interval scale defines a distance from one point to another, so that there are equal intervals between con- secutive numbers. This property per- mits computations not available with the ordinal scale, such as calculating the mean. However, there is no absolute zero point in an interval scale, and thus ratios do not make sense. Care is thus needed when you make comparisons. The Celsius and Fahrenheit tempera- ture scales, for example, are interval, so we cannot say that today’s 30-degree Celsius temperature is twice as hot as yesterday’s 15 degrees.
The scale with the most information and flexibility is the ratio scale, which
incorporates an absolute zero, preserves ratios, and permits the most sophisti- cated analysis. Measures such as lines of code or numbers of defects are ratio measures. It is for this scale that we can say that A is twice the size of B.
The importance of measurement type to software measurement rests in the types of calculations you can do with each scale. For example, you can- not compute a meaningful mean and standard deviation for a nominal scale; such calculations require an interval or ratio scale. Thus, unless we are aware of the scale types we use, we are likely to misuse the data we collect.
Researchers such as Norman Fenton and Horst Zuse have worked extensive- ly in applying measurement theory to proposed software metrics. Among the ongoing questions is whether popular metrics such as function points are meaningful, in that they include unac- ceptable computations for their scale types. There are also questions about what entity function points measure.
Validation. We validate measures so we can be sure that the metrics we use are actually measuring what they claim to measure. For example, Tom McCabe proposed that we use cyclomatic num- ber, a property of a program’s control- flow graph, as a measure of testing com-
plexity. Many researchers are careful to state that cyclomatic number is a mea- sure of structural complexity, but it does not capture all aspects of the difficulty we have in understanding a program. Other examples include Ross Jeffery’s study of programs from a psychological perspective, which applies notions
I E E E S O FT W A R E 3 5
Unless we are aware of the scale types we use, we are likely to misuse the data we collect.
.
about how much we can track and absorb as a way of measuring code complexity. Maurice Halstead claimed that his work, too, had psychological underpinnings, but the psychological basis for Halstead’s “software science”
measures have been soundly debunked by Neil Coulter.5 (Bill Curtis and his colleagues at General Electric found, however, that Halstead’s count of operators and operands is a useful mea- sure of program size.6)
We say that a measure is valid if it satisfies the representation condition: if it captures in the mathematical world the behavior we perceive in the empiri- cal world. For example, we must show that if H is a measure of height, and if A is taller than B, then H(A) is larger than H(B). But such a proof must by its nature be empirical and it is often diffi- cult to demonstrate. In these cases, we must consider whether we are measur- ing something with a direct measure (such as size) or an indirect measure (such as using the number of decision points as a measure of size) and what entity and attribute are being addressed.
Several attempts have been made to list a set of rules for validation. Elaine Weyuker suggested rules for validating complexity,7 and Austin Melton and his colleagues have proffered a similar, general list for the behavior of all met- rics.8 However, each of these frame- works has been criticized and there is not yet a standard, accepted way of val- idating a measure.
The notion of validity is not specific to software engineering, and general concepts that we rarely consider—such as construct validity and predictive validity—should be part of any discus- sion of software engineering measure- ment. For example, Kitchenham, Pfleeger, and Fenton have proposed a general framework for validating soft- ware engineering metrics based on mea- surement theory and statistical rules.9
Apples and oranges. Measurement the- ory and validation should not distract us from the considerable difficulty of mea- suring software in the field. A major dif- ficulty is that we often try to relate mea- sures of a physical object (the software) with human and organizational behav- iors, which do not follow physical laws.
Consider, for example, the capability maturity level as a measure. The matu- rity level reflects an organization’s soft- ware development practices and is pur- ported to predict an organization’s abil- ity to produce high-quality software on time. But even if an organization at level 2 can be determined, through extensive experimentation, to produce better software (measured by fewer delivered defects) than a level 1 organi- zation, it doesn’t hold that all level 2 organizations develop software better than level 1 organizations. Some researchers welcome the use of capabil- ity maturity level as a predictor of the likelihood (but not a guarantee) that a level n organization will be better than a level n−1. But others insist that, for CMM level to be a measure in the mea- surement theory sense, level n must always be better than level n−1.
Still, a measure can be useful as a predictor without being valid in the sense of measurement theory. More- over, we can gather valuable informa- tion by applying—even to heuristics— the standard techniques used in other scientific disciplines to assess association by analyzing distributions. But let’s complicate this picture further. Suppose we compare a level 3 organization that
is constantly developing different and challenging avionics systems with a level 2 organization that develops versions of a relatively simple Cobol business appli- cation. Obviously, we are comparing sliced apples with peeled oranges, and the domain, customer type, and many other factors moderate the relationships we observe.
This situation reveals problems not with the CMM as a measure, but with the model on which the CMM is based. We begin with simple models that pro- vide useful information. Sometimes those models are sufficient for our needs, but other times we must extend the simple models in order to handle more complex situations. Again, this approach is no different from other sci- ences, where simple models (of molec- ular structure, for instance) are expand- ed as scientists learn more about the factors that affect the outcomes of the processes they study.
State of the gap. In general, measure- ment theory is getting a great deal of attention from researchers but is being ignored by practitioners and cus - tomers, who rely on empirical evidence of a metric’s utility regardless of its sci- entific grounding.
Researchers should work closely with practitioners to understand the valid uses and interpretations of a soft- ware measure based on its measure- ment-theoretic attributes. They should also consider model validity separate from measurement validity, and devel- op more accurate models on which to base better measures. Finally, there is much work to be done to complete a framework for measurement valida- tion, as well as to achieve consensus within the research community on the framework’s accuracy and usefulness.
MEASUREMENT MODELS
A measurement makes sense only when it is associated with one or more
3 6 M A R C H / A P R I L 1 9 9 7
Measurement theory is getting attention from researchers but is being ignored by practitioners and customers.
.
models. One essential model tells us the domain and range of the measure mapping; that is, it describes the entity and attribute being measured, the set of possible resulting measures, and the relationships among several measures (such as productivity is equal to size produced per unit of effort). Models also distinguish prediction from assess- ment; we must know whether we are using the measure to estimate future characteristics from previous ones (such as effort, schedule, or reliability estimation) or determining the current condition of a process, product, or resource (such as assessing defect den- sity or testing effectiveness).
There are also models to guide us in deriving and applying measurement. A commonly used model of this type is the Goal-Question-Metric paradigm suggested by Vic Basili and David Weiss (and later expanded by Basili and Dieter Rombach).10 This approach uses templates to help prospective users derive measures from their goals and the questions they must answer during development. The template encourages the user to express goals in the following form:
Analyze the [object] for the purpose of [purpose] with respect to [focus] from the viewpoint of [viewpoint] in the [environment].
For example, an XYZ Corporation manager concerned about overrunning the project schedule might express the goal of “meeting schedules” as
Analyze the project for the purpose of control with respect to meeting sched- ules from the viewpoint of the project manager in the XYZ Corporation.
From each goal, the manager can derive questions whose answers will help determine whether the goal has been met. The questions derived suggest met- rics that should be used to answer the questions. This top-down derivation assists managers and developers not only in knowing what data to collect but also
in understanding the type of analysis needed when the data is in hand.
Some practitioners, such as Bill Hetzel, encourage a bottom-up ap- proach to metrics application, where organizations measure what is avail- able, regardless of goals.11 Other mod- els include Ray Offen and Jeffery’s M3P model derived from business goals described on page 45 and the combination of goal-question-metric and capability maturity built into the European Community’s ami project framework.
Model research. Experimentation models of measurement are essential for case studies and experiments for software engineering research. For example, an organization may build software using two different tech- niques: one a formal method, another not. Researchers would then evaluate the resulting software to see if one method produced higher quality soft- ware than the other.
An experimentation model de- scribes the hypothesis being tested, the factors that can affect the outcome, the degree of control over each factor, the relationships among the factors, and the plan for performing the research and evaluating the outcome. To address the lack of rigor in software experimentation, projects such as the UK’s Desmet—reported upon exten- sively in ACM Software Engineering Notes beginning in October of 1994— have produced guidelines to help soft- ware engineers design surveys, case studies, and experiments.
Model future. As software engineers, we tend to neglect models. In other sci- entific disciplines, models act to unify and explain, placing apparently disjoint events in a larger, more understandable framework. The lack of models in soft- ware engineering is symptomatic of a much larger problem: a lack of systems focus. Few software engineers under- stand the need to define a system
boundary or explain how one system interacts with another. Thus, research and practice have a very long way to go in exploring and exploiting what mod- els can do to improve software products and processes.
MEASURING THE PROCESS
For many years, computer scientists and software engineers focused on measuring and understanding code. In recent years—as we have come to understand that product quality is evi- dence of process success—software process issues have received much attention. Process measures include large-grain quantifications, such as the CMM scale, as well as smaller-grain evaluations of particular process activi- ties, such as test effectiveness.
Process perspective. Process research can be viewed from several perspec- tives. Some process researchers devel- op process description languages, such as the work done on the Alf (Esprit) project. Here, measurement supports the description by counting tokens that indicate process size and complexity. Other researchers investigate the actu- al process that developers use to build
software. For example, early work by Curtis and his colleagues at MCC revealed that the way we analyze and design software is typically more itera- tive and complex than top-down.12
Researchers also use measurement to help them understand and improve
I E E E S O FT W A R E 3 7
The lack of models in software engineering is symptomatic of a lack of systems focus.
.
3 8 M A R C H / A P R I L 1 9 9 7
Books ♦ D. Card and R. Glass, Measuring Software Design
Complexity, Prentice Hall, Englewood Cliffs, N.J., 1991. ♦ S.D. Conte, H E. Dunsmore, and V.Y. Shen, Software
Engineering Metrics and Models, Benjamin Cummings, Menlo Park, Calif., 1986.
♦ T. DeMarco, Controlling Software Projects, Dorset House, New York, 1982.
♦ J.B. Dreger, Function Point Analysis, Prentice Hall, Englewood Cliffs, N.J., 1989.
♦ N. Fenton and S.L. Pfleeger, Software Metrics: A Rigorous and Practical Approach, second edition, International Thomson Press, London, 1996.
♦ R.B. Grady, Practical Software Metrics for Project Management and Process Improvement, Prentice Hall, Englewood Cliffs, N.J., 1992.
♦ R.B. Grady and D.L. Caswell, Software Metrics: Establishing a Company-Wide Program, Prentice Hall, Englewood Cliffs, N.J., 1987.
♦ T.C. Jones, Applied Software Measurement: Assuring Productivity and Quality, McGraw Hill, New York, 1992.
♦ K. Moeller and D.J. Paulish, Software Metrics: A Practitioner’s Guide to Improved Product Development, IEEE Computer Society Press, Los Alamitos, Calif., 1993.
♦ P. Oman and S.L. Pfleeger, Applying Software Metrics, IEEE Computer Society Press, Los Alamitos, Calif., 1996.
Journals ♦ IEEE Software (Mar. 1991 and July 1994, special issues
on measurement; January 1996, special issue on software quality)
♦ Computer (September 1994, special issue on product metrics)
♦ IEEE Transactions on Software Engineering ♦ Journal of Systems and Software ♦ Software Quality Journal ♦ IEE Journal ♦ IBM Systems Journal ♦ Information and Software Technology ♦ Empirical Software Engineering: An International Journal
Key Journal Articles ♦ V.R. Basili and H.D. Rombach, “The TAME Project:
Towards Improvement-Oriented Software Environments,” IEEE Trans. Software Eng., Vol. 14, No. 6, 1988, pp. 758-773.
♦ C. Billings et al., “Journey to a Mature Software Pro- cess,” IBM Systems Journal, Vol. 33, No. 1, 1994, pp. 46-61.
♦ B. Curtis, “Measurement and Experimentation in Software Engineering,” Proc. IEEE, Vol. 68, No. 9, 1980, pp. 1144-1157.
♦ B. Kitchenham, L. Pickard, and S.L. Pfleeger, “Using Case Studies for Process Improvement,” IEEE Software, July 1995.
♦ S.L. Pfleeger, “Experimentation in Software Engineering,” Annals Software Eng., Vol. 1, No. 1, 1995.
♦ S.S. Stevens, “On the Theory of Scales of Measurement,” Science, No. 103, 1946, pp. 677-680.
Conferences ♦ Applications of Software Measurement, sponsored by
Software Quality Engineering, held annually in Florida and California on alternate years. Contact: Bill Hetzel, SQE, Jacksonville, FL, USA.
♦ International Symposium on Software Measurement, sponsored by IEEE Computer Society (1st in Baltimore, 1993, 2nd in London, 1994, 3rd in Berlin, 1996; 4th is upcoming in Nov. 1997 in Albuquerque, New Mexico); pro- ceedings available from IEEE Computer Society Press. Contact: Jim Bieman, Colorado State University, Fort Collins, CO, USA.
♦ Oregon Workshop on Software Metrics Workshop, sponsored by Portland State University, held annually near Portland, Oregon. Contact: Warren Harrison, PSU, Portland, OR, USA.
♦ Minnowbrook Workshop on Software Performance Evaluation, sponsored by Syracuse University, held each summer at Blue Mountain Lake, NY. Contact: Amrit Goel, Syracuse University, Syracuse, NY, USA.
♦ NASA Software Engineering Symposium, sponsored by NASA Goddard Space Flight Center, held annually at the end of November or early December in Greenbelt, Maryland; proceedings available. Contact: Frank McGarry, Computer Sciences Corporation, Greenbelt, MD, USA.
♦ CSR Annual Workshop, sponsored by the Centre for Software Reliability, held annually at locations throughout Europe; proceedings available. Contact: Bev Littlewood, Centre for Software Reliability, City University, London, UK.
Organizations ♦ Australian Software Metrics Association. Contact:
Mike Berry, School of Information Systems, University of New South Wales, Sydney 2052, Australia.
♦ Quantitative Methods Committee, IEEE Computer Society Technical Council on Software Engineering. Contact: Jim Bieman, Department of Computer Science, Colorado State University, Fort Collins, CO, USA.
♦ Centre for Software Reliability. Contact: Bev Littlewood, CSR, City University, London, UK.
♦ Software Engineering Laboratory. Contact: Vic Basili, Department of Computer Science, University of Maryland, College Park, MD, USA.
♦ SEI Software Measurement Program. Contact: Anita Carleton, Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA, USA.
♦ Applications of Measurement in Industry (ami) User Group. Contact: Alison Rowe, South Bank University, London, UK.
♦ International Society of Parametric Analysts. Contact: J. Clyde Perry & Associates, PO Box 6402, Chesterfield, MO 63006-6402, USA.
♦ International Function Point Users Group. Contact: IFPUG Executive Office, Blendonview Office Park, 5008-28 Pine Creek Drive, Westerville, OH 43081-4899, USA.
MORE INFORMATION ABOUT MEASUREMENT
◆
.
existing processes. A good example is an ICSE 1994 report in which Larry Votta, Adam Porter, and Basili report- ed that scenario-based inspections (where each inspector looked for a par- ticular type of defect) produced better results than ad hoc or checklist-based inspections (where each inspector looks for any type of defect).13 Basili and his colleagues at the NASA Software Engineering Laboratory con- tinue to use measurement to evaluate the impact of using Ada, cleanroom, and other technologies that change the software development process. Billings and his colleagues at Loral (formerly IBM) are also measuring their process for building space shuttle software.
Remeasuring. The reuse community provides many examples of process- related measurement as it tries to determine how reuse affects quality and productivity. For example, Wayne Lim has modeled the reuse process and suggested measurements for assessing reuse effectiveness.14 Similarly, Shari Lawrence Pfleeger and Mary Theo- fanos have combined process maturity concepts with a goal-question-metric approach to suggest metrics to instru- ment the reuse process.15
Reengineering also offers opportuni- ties to measure process change and its effects. At the 1994 International Software Engineering Research Network meeting, an Italian research group reported on their evaluation of a large system reengineering project. In the project, researchers kept an extensive set of measurements to track the impact of the changes made as a banking appli- cation’s millions of lines of Cobol code were reengineered over a period of years. These measures included the sys- tem structure and the number of help and change requests. Measurement let the team evaluate the success and pay- back of the reengineering process.
Process problems. Use of these and other process models and measurements
raises several problems. First, large- grained process measures require valida- tion, which is difficult to do. Second, project managers are often intimidated by the effort required to track process measures throughout development. Individual process activities are usually easier to evaluate, as they are smaller and more controllable. Third, regardless of the granularity, process measures usually require an underlying model of how they interrelate; this model is usually missing from process understanding and evaluation, so the results of research are difficult to interpret. Thus, even as attention turns increasingly to process in the larger community, process measure- ment research and practice lag behind the use of other measurements.
MEASURING THE PRODUCTS
Because products are more concrete than processes and resources and are thus easier to measure, it is not surpris- ing that most measurement work is directed in this area. Moreover, cus- tomers encourage product assessment because they are interested in the final product’s characteristics, regardless of the process that produced it. As a result, we measure defects (in specifica- tion, design, code, and test cases) and failures as part of a broader program to assess product quality. Quality frame- works, such as McCall’s or the pro- posed ISO 9126 standard, suggest ways to describe different aspects of product quality, such as distinguishing usability from reliability from maintainability.
Measuring risk. Because failures are the most visible evidence of poor quali- ty, reliability assessment and prediction have received much attention. There are many reliability models, each focused on using operational profile and mean-time-to-failure data to pre- dict when the next failure is likely to occur. These models are based on probability distributions, plus assump-
tions about whether new defects are introduced when old ones are repaired. However, more work is required both in making the assumptions realistic and in helping users select appropriate models. Some models are accurate most of the time, but there are no guarantees that a particular model will perform well in a particular situation.
Most developers and customers do not want to wait until delivery to determine if the code is reliable or maintainable. As a result, some practi- tioners measure defects as evidence of code quality and likely reliability. Ed Adams of IBM showed the dangers of this approach. He used IBM operating system data to show that 80 percent of the reliability problems were caused by only 2 percent of the defects. 16 Research must be done to determine which defects are likely to cause the most problems, as well as to prevent such problems before they occur.
Early measurement. Earlier life-cycle products have also been the source of many measurements. Dolly Samson and Jim Palmer at George Mason University have produced tools that measure and evaluate the quality of informal, English-language require- ments; these tools are being used by the US Federal Bureau of Investigation and
the Federal Aviation Authority on pro- jects where requirements quality is essential. Similar work has been pur- sued by Anthony Finkelstein’s and Alistair Sutcliffe’s research groups at City University in London. Suzanne
I E E E S O FT W A R E 3 9
There are no guarantees that a particular model will perform well in a particular situation.
.
Robertson and Shari Pfleeger are cur- rently working with the UK Ministry of Defence to evaluate requirements structure as well as quality, so require-
ments volatility and likely reuse can be assessed. However, because serious measurement of requirements attribut- es is just beginning, very little require- ments measurement is done in practice.
Design and code. Researchers and practitioners have several ways of evalu- ating design quality, in the hope that good design will yield good code. Sallie Henry and Dennis Kafura at Virginia Tech proposed a design measure based on the fan-in and fan-out of modules. David Card and Bill Agresti worked with NASA Goddard developers to derive a measure of software design complexity that predicts where code errors are likely to be. But many of the existing design measures focus on func- tional descriptions of design; Shyam Chidamber and Chris Kemerer at MIT have extended these types of measures to object-oriented design and code.
The fact that code is easier to mea- sure than earlier products does not prevent controversy. Debate continues to rage over whether lines of code are a reasonable measure of software size. Bob Park at SEI has produced a frame- work that organizes the many decisions involved in defining a lines-of-code count, including reuse, comments, exe- cutable statements, and more. His report makes clear that you must know your goals before you design your measures. Another camp measures code in terms of function points, claiming that such measures capture
the size of functionality from the speci- fication in a way that is impossible for lines of code. Both sides have valid points, and both have attempted to unify and define their ideas so that counting and comparing across organi- zations is possible. However, practi- tioners and customers have no time to wait for resolution. They need mea- sures now that will help them under- stand and predict likely effort, quality, and schedule.
Thus, as with other types of mea- surement, there is a large gap between the theory and practice of product measurement. The practitioners and customers know what they want, but the researchers have not yet been able to find measures that are practical, sci- entifically sound (according to mea- surement theory principles), and cost- effective to capture and analyze.
MEASURING THE RESOURCES
For many years, some of our most insightful software engineers (includ- ing Jerry Weinberg, Tom DeMarco, Tim Lister, and Bill Curtis) have encouraged us to look at the quality and variability of the people we employ for the source of product variations. Some initial measurement work has been done in this area.
DeMarco and Lister report in Peopleware on an IBM study which showed that your surroundings—such as noise level, number of interruptions, and office size—can affect the produc- tivity and quality of your work. Likewise, a study by Basili and David Hutchens suggests that individual vari- ation accounts for much of the differ- ence in code complexity;17 these results support a 1979 study by Sylvia Sheppard and her colleagues at ITT, showing that the average time to locate a defect in code is not related to years of experience but rather to breadth of experience. However, there is relative- ly little attention being paid to human
resource measurement, as developers and managers find it threatening.
Nonhuman resources. More attention has been paid to other resources: bud- get and schedule assessment, and effort, cost, and schedule prediction. A rich assortment of tools and techniques is available to support this work, includ- ing Barry Boehm’s Cocomo model, tools based on Albrecht’s function points model, Larry Putnam’s Slim model, and others. However, no model works satisfactorily for everyone, in part because of organizational and pro- ject differences, and in part because of model imperfections. June Verner and Graham Tate demonstrated how tailor- ing models can improve their perfor- mances significantly. Their 4GL modi- fication of an approach similar to func- tion points was quite accurate com- pared to several other alternatives.18,19 Barbara Kitchenham’s work on the Mermaid Esprit project demonstrated how several modeling approaches can be combined into a larger model that is more accurate than any of its compo- nent models.20 And Boehm is updating and improving his Cocomo model to reflect advances in measurement and process understanding, with the hope of increasing its accuracy.21
Shaky models. The state of the prac- tice in resource measurement lags far behind the research. Many of the research models are used once, publi- cized, and then die. Those models that are used in practice are often imple- mented without regard to the underly- ing theory on which they are built. For example, many practitioners implement Boehm’s Cocomo model, using not only his general approach but also his cost factors (described in his 1981 book, Software Engineering Economics). However, Boehm’s cost factor values are based on TRW data captured in the 1970s and are irrelevant to other envi- ronments, especially given the radical change in development techniques and
4 0 M A R C H / A P R I L 1 9 9 7
Debate continues over whether lines of code are a reasonable measure of software size.
.
tools since Cocomo was developed. Likewise, practitioners adopt the equa- tions and models produced by Basili’s Software Engineering Laboratory, even though the relationships are derived from NASA data and are not likely to work in other environments. The research community must better com- municate to practitioners that it is the techniques that are transferable, not the data and equations themselves.
STORING, ANALYZING AND REPORTING THE MEASUREMENTS
Researchers and practitioners alike often assume that once they choose the metrics and collect the data, their mea- surement activities are done. But the goals of measurement—understanding and change—are not met until they analyze the data and implement change.
Measuring tools. In the UK, a team led by Kitchenham has developed a tool that helps practitioners choose metrics and builds a repository for the collected data. Amadeus, an American project funded by the Advanced Research Projects Agency, has some of the same goals; it monitors the devel- opment process and stores the data for later analysis. Some Esprit projects are working to combine research tools into powerful analysis engines that will help developers manipulate data for decision making. For example, Cap Volmac in the Netherlands is leading the Squid project to build a comprehensive soft- ware quality assessment tool.
It is not always necessary to use sophisticated tools for metrics collec- tion and storage, especially on projects just beginning to use metrics. Many practitioners use spreadsheet software, database management systems, or off- the-shelf statistical packages to store and analyze data. The choice of tool depends on how you will use the mea- surements. For many organizations,
simple analysis techniques such as scat- ter charts and histograms provide use- ful information about what is happen- ing on a project or in a product. Others prefer to use statistical analysis, such as regression and correlation, box plots, and measures of central tendency and dispersion. More complex still are clas- sification trees, applied by Adam Porter and Rick Selby to determine which metrics best predict quality or productivity. For example, if module quality can be assessed using the num- ber of defects per module, then a clas- sification tree illustrates which of the metrics collected predicts modules likely to have more than a threshold number of defects.22
Process measures are more difficult to track, as they often require trace- ability from one product or activity to another. In this case, databases of traceability information are needed, coupled with software to track and analyze progress. Practitioners often use their configuration management system for these measures, augmenting existing configuration information with measurement data.
Analyzing tools. For storing and ana- lyzing large data sets, it is important to choose appropriate analysis techniques. Population dynamics and distribution are key aspects of this choice. When sampling from data, it is essential that the sample be representative so that your judgments about the sample apply to the larger population. It is equally important to ensure that your analysis technique is suitable for the data’s dis- tribution. Often, practitioners use a technique simply because it is available on their statistical software packages, regardless of whether the data is dis- tributed normally or not. As a result, invalid parametric techniques are used instead of the more appropriate non- parametric ones. Many of the paramet- ric techniques are robust enough to be used with nonnormal distributions, but you must verify this robustness.
Applying the appropriate statistical techniques to the measurement scale is even more important. Measures of central tendency and dispersion differ with the scale, as do appropriate trans- formations. You can use mode and fre- quency distributions to analyze nomi- nal data that describes categories, but you cannot use means and standard deviations. With ordinal data—where an order is imposed on the cate - gories—you can use medians, maxima, and minima for analysis. But you can use means, standard deviations, and more sophisticated statistics only when you have interval or ratio data.
Presentation. Presenting measure- ment data so that customers can understand it is problematic because metrics are chosen based on business and development goals and the data is collected by developers. Typically, cus- tomers are not experts in software engineering; they want a “big picture” of what the software is like, not a large vector of measures of different aspects. Hewlett-Packard has been successful in using Kiviat diagrams (sometimes called radar graphs) to depict multiple measures in one picture, without losing the integrity of the individual mea- sures. Similarly, Contel used multiple metrics graphs to report on software switch quality and other characteristics.
Measurement revisited. A relatively new area of research is the packaging of previous experience for use by new development and maintenance projects. Since many organizations produce new software that is similar to their old soft- ware or developed using similar tech- niques, they can save time and money
I E E E S O FT W A R E 4 1
The choice of tool depends on how you will use the measurements.
.
by capturing experience for reuse at a later time. This reuse involves not only code but also requirements, designs, test cases, and more. For example, as part of its Experience Factory effort, the SEL is producing a set of docu- ments that suggests how to introduce techniques and establish metrics pro- grams. Guillermo Arango’s group at Schlumberger has automated this expe- rience capture in a series of “project books” that let developers call up requirements, design decisions, mea- surements, code, and documents of all kinds to assist in building the next ver- sion of the same or similar product.23
Refining the focus. In the past, mea- surement research has focused on met- ric definition, choice, and data collec- tion. As part of a larger effort to exam- ine the scientific bases for software engineering research, attention is now turning to data analysis and reporting.
Practitioners continue to use what is readily available and easy to use, regard- less of its appropriateness. This is in
part the fault of researchers, who have not described the limitations of and constraints on techniques put forth for practical use.
Finally, the measurement commu- nity has yet to deal with the more global issue of technology transfer. It is unreasonable for us to expect practi- tioners to become experts in statistics, probability, or measurement theory, or even in the intricacies of calculating code complexity or modeling parame- ters. Instead, we need to encourage researchers to fashion results into tools and techniques that practitioners can easily understand and apply.
ust as we preach the need for mea- surement goals, so too must we
base our activities on customer goals. As practitioners and customers cry out for measures early in the development cycle, we must focus our efforts on measuring aspects of requirements analysis and design. As our customers request measurements for evaluating commercial off-the-shelf software, we
must provide product metrics that sup- port such purchasing decisions. And as our customers insist on higher levels of reliability, functionality, usability, reusability, and maintainability, we must work closely with the rest of the software engineering community to understand the processes and resources that contribute to good products.
We should not take the gap between measurement research and practice lightly. During an open-mike session at the metrics symposium, a statistician warned us not to become like the statistics community, which he characterized as a group living in its own world with theories and results that are divorced from reality and use- less to those who must analyze and understand them. If the measurement community remains separate from mainstream software engineering, our delivered code will be good in theory but not in practice, and developers will be less likely to take the time to mea- sure even when we produce metrics that are easy to use and effective.
4 2 M A R C H / A P R I L 1 9 9 7
REFERENCES
1. E.F. Weller, “Using Metrics to Manage Software Projects,” Computer, Sept. 1994, pp. 27-34. 2. W.C. Lim, “Effects of Reuse on Quality, Productivity, and Economics,” IEEE Software, Sept. 1994, pp. 23-30. 3. M.K. Daskalantonakis, “A Practical View of Software Measurement and Implementation Experiences within Motorola,” IEEE Trans. Software Eng., Vol. 18,
No. 11, 1992, pp. 998-1010. 4. W.M. Evangelist, “Software Complexity Metric Sensitivity to Program Restructuring Rules,” J. Systems Software, Vol. 3, 1983, pp. 231-243. 5. N. Coulter, “Software Science and Cognitive Psychology,” IEEE Trans. Software Eng., Vol. 9, No. 2, 1983, pp. 166-171. 6. B. Curtis et al., “Measuring the Psychological Complexity of Software Maintenance Tasks with the Halstead and McCabe Metrics,” IEEE Trans. Software
Eng., Vol. 5, No. 2, 1979, pp. 96-104. 7. E.J. Weyuker, “Evaluating Software Complexity Measures,” IEEE Trans. Software Eng., Vol. 14, No. 9, 1988, pp. 1357-1365. 8. A.C. Melton et al., “Mathematical Perspective of Software Measures Research,” Software Eng. J., Vol. 5, No. 5, 1990, pp. 246-254. 9. B. Kitchenham, S.L. Pfleeger, and N. Fenton, “Toward a Framework for Measurement Validation,” IEEE Trans. Software Eng., Vol. 21, No. 12, 1995,
pp. 929-944. 10. V.R. Basili and D. Weiss, “A Methodology For Collecting Valid Software Engineering Data,” IEEE Trans. Software Eng., Vol. 10, No. 3, 1984, pp. 728-738. 11. W. Hetzel, Making Software Measurement Work: Building an Effective Software Measurement Program, QED Publishing, Boston, 1993. 12. B. Curtis, H. Krasner, and Neil Iscoe, “A Field Study of the Software Design Process for Large Systems,” Comm. ACM, Nov. 1988, pp. 1268-1287.
J
◆
.
I E E E S O FT W A R E 4 3
13. A.A. Porter, L.G. Votta, and V.R. Basili, “An Experiment to Assess Different Defect Detection Methods for Software Requirements Inspections,” Proc. 16th Int’l Conf. Software Eng., 1994, pp. 103-112.
14. W.C. Lim “Effects of Reuse on Quality, Productivity and Economics,” IEEE Software, Sept. 1994, pp. 23-30. 15. M.F. Theofanos and S.L. Pfleeger, “A Framework for Creating a Reuse Measurement Plan,” tech. report, 1231/D2, Martin Marietta Energy Systems, Data
Systems Research and Development Division, Oak Ridge, Tenn., 1993. 16. E. Adams, “Optimizing Preventive Service of Software Products,” IBM J. Research and Development, Vol. 28, No. 1, 1984, pp. 2-14. 17. V.R. Basili and D.H. Hutchens, “An Empirical Study of a Syntactic Complexity Family,” IEEE Trans. Software Eng., Vol. 9, No. 6, 1983, pp. 652-663. 18. J.M. Verner and G. Tate, “Estimating Size and Effort in Fourth-Generation Language Development,” IEEE Software, July 1988, pp. 173-177. 19. J. Verner. and G. Tate, “A Software Size Model,” IEEE Trans. Software Eng., Vol. 18, No. 4, 1992, pp. 265-278. 20. B.A. Kitchenhamm P.A.M. Kok, and J Kirakowski, “The Mermaid Approach to Software Cost Estimation,” Proc. Esprit, Kluwer Academic Publishers,
Dordrecht, the Netherlands, 1990, pp. 296-314. 21. B.W. Boehm et al., “Cost Models for Future Life Cycle Processes: COCOMO 2.0,” Annals Software Eng. Nov. 1995, pp. 1-24. 22. A. Porter and R. Selby, “Empirically Guided Software Development Using Metric-Based Classification Trees,” IEEE Software, Mar. 1990, pp. 46-54. 23. G. Arango, E. Schoen, and R. Pettengill, “Design as Evolution and Reuse,” in Advances in Software Reuse, IEEE Computer Society Press, Los Alamitos, Calif.,
March 1993, pp. 9-18.
Shari Lawrence Pfleeger is director of the Center for Research in Evaluating Software Technology (CREST) at Howard University in Washington, DC. The Center establishes partnerships with industry and government to evaluate the effectiveness of software engineering techniques and tools. She is also president of Systems/Software Inc., a consultancy specializing in software engineering and technology evaluation. Pfleeger is the author of several textbooks and dozens of articles on software engineering and measurement. She is an associate editor-in-chief of IEEE Software and
is an advisor to IEEE Spectrum. Pfleeger is a member of the executive commit- tee of the IEEE Technical Council on Software Engineering, and the program cochair of the next International Symposium on Software Metrics in Albuquerque, New Mexico.
Pfleeger received a PhD in information technology and engineering from George Mason University. She is a member of the IEEE and ACM.
Ross Jeffery is a professor of information systems and director of the Centre for Advanced Empirical Software Research at the University of New South Wales, Australia. His research interests include software engi- neering process and product modeling and improve- ment, software metrics, software technical and manage- ment reviews, and software resource modeling. He is on the editorial board of the IEEE Transactions on Software Engineering, the Journal of Empirical Software Engineering, and the editorial board of the Wiley International Series on information systems. He is also a
founding member of the International Software Engineering Research Network.
Bill Curtis is co-founder and chief scientist of TeraQuest Metrics in Austin, Texas where he works with organizations to increase their software develop- ment capability. He is a former director of the Software Process Program in the Software Engineering Institute at Carnegie Mellon University, where he is a visiting scientist. Prior to joining the SEI, Curtis worked at MCC, ITT's Programming Technology Center, GE's Space Division, and the University of Washington. He was a founding faculty member of the Software Quality Institute at the University of Texas.
He is co-author of the Capability Maturity Model for software and the principal author of the People CMM. He is on the editorial boards of seven technical journals and has published more than 100 technical articles on software engi- neering, user interface, and management.
Barbara Kitchenham is a principal researcher in soft- ware engineering at Keele University. Her main inter- est is in software metrics and their support for project and quality management. She has written more than 40 articles on the topic as well as the book Software Metrics: Measurement for Software Process Improvement. She spent 10 years working for ICL and STC, followed by two years at City University and seven years at the UK National Computing Centre, before joining Keele in February 1996.
Kitchenham received a PhD from Leeds University.
Address questions about this article to Pfleeger at CREST, Howard University Department of Systems and Computer Science, Washington, DC 20059; [email protected].
.