Skip to main content

Teaching Software Metrology: The Science of Measurement for Software Engineering

  • Chapter
  • First Online:
Handbook on Teaching Empirical Software Engineering

Abstract

While the methodological rigor of computing research has improved considerably in the past two decades, quantitative software engineering research is hampered by immature measures and inattention to theory. Measurement—the principled assignment of numbers to phenomena—is intrinsically difficult because observation is predicated upon not only theoretical concepts but also the values and perspective of the research. Despite several previous attempts to raise awareness of more sophisticated approaches to measurement and the importance of quantitatively assessing reliability and validity, measurement issues continue to be widely ignored. The reasons are unknown, but differences in typical engineering and computer science graduate training programs (e.g., compared to psychology and management) are likely involved. This chapter therefore reviews key concepts in the science of measurement and applies them to software engineering research. A series of exercises for applying important measurement concepts to the reader’s research are included, and a sample dataset for the reader to try some of the statistical procedures mentioned is provided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Qualitative researchers can dismiss construct validity concerns by embracing interpretivism, but claiming to be an interpretivist means you’re not doing predominately quantitative research.

  2. 2.

    A mathematically simple way to get the shared variance is to sum the indicators, but more sophisticated approaches, such as confirmatory factor analysis, are typically used.

  3. 3.

    Even when correcting \(\alpha \) for multiple comparisons, the more success dimensions we evaluate, the more likely we will find one on which new tool excels.

  4. 4.

    “Operationalize” is a common term in the literature on construct validity. It refers to how we measure the construct, including our instruments and statistical approach.

  5. 5.

    We have not used PyTorrent but it looks promising.

  6. 6.

    https://github.com/apache/maven

  7. 7.

    https://www.designite-tools.com/

  8. 8.

    http://www.virtualmachinery.com/jhdownload.htm

  9. 9.

    https://scitools.com/

  10. 10.

    https://personality-project.org/r/psych/

  11. 11.

    https://github.com/apache/maven

References

  1. Agbo, A.A.: Cronbach’s alpha: review of limitations and associated recommendations. J. Psychol. Africa 20(2), 233–239 (2010)

    Article  Google Scholar 

  2. Archer, M., Bhaskar, R., Collier, A., Lawson, T., Norrie, A.: Critical Realism: Essential Readings. Routledge, London (2013)

    Book  Google Scholar 

  3. Bahrami, M., Shrikanth, N.C., Ruangwan, S., Liu, L., Mizobuchi, Y., Fukuyori, M., Chen, W.P., Munakata, K., Menzies, T.: Pytorrent: A python library corpus for large-scale language models (2021). arXiv [cs.SE]. https://doi.org/10.48550/arXiv.2110.01710

  4. Baltes, S., Ralph, P.: Sampling in software engineering research: a critical review and guidelines. Empir. Softw. Eng. 27(4), 94 (2022)

    Article  Google Scholar 

  5. Basilevsky, A.: Statistical Factor Analysis and Related Methods: Theory and Applications. Wiley Series in Probability and Statistics. Wiley, London (1994)

    Google Scholar 

  6. Briggs, D.: Historical and Conceptual Foundations of Measurement in the Human Sciences: Credos and Controversies. Routledge, London (2021)

    Book  Google Scholar 

  7. Campbell, N.: Physics: The Elements. Cambridge University Press, Cambridge (2013)

    Google Scholar 

  8. Cattell, R.B.: The scree test for the number of factors. Multivariate Behav. Res. 1(2), 245–276 (1966)

    Article  Google Scholar 

  9. Cerri, L.Q., Justo, M.C., Clemente, V., Gomes, A.A., Pereira, A.S., Marques, D.R.: Insomnia severity index: a reliability generalisation meta-analysis. J. Sleep Res. 32(4), e13835 (2023)

    Article  Google Scholar 

  10. Coltman, T., Devinney, T.M., Midgley, D.F., Venaik, S.: Formative versus reflective measurement models: two applications of formative measurement. J. Bus. Res. 61(12), 1250–1262 (2008)

    Article  Google Scholar 

  11. Costello, A., Osborne, J.: Best practices in exploratory factor analysis: four recommendations for getting the most from your analysis. Pract. Assessment Res. Eval. 10, 1–9 (2005)

    Google Scholar 

  12. Drost, E.A.: Validity and reliability in social science research. Educ. Res. Perspect. 38(1), 105–123 (2011)

    Google Scholar 

  13. Fassott, G., Henseler, J.: Formative (measurement). In: Wiley Encyclopedia of Management. John Wiley and Sons, London (2015). https://doi.org/10.1002/9781118785317.weom090113

  14. Field, A.: Discovering Statistics Using IBM SPSS Statistics, 5th edn. Sage (2017)

    Google Scholar 

  15. Flater, D.W., Black, P.E., Fong, E.N., Kacker, R.N., Okun, V., Wood, S.S., Kuhn, D.R.: A rational foundation for software metrology. Tech. Rep. IR 8101, National Institute of Standards and Technology (2016). https://doi.org/10.6028/NIST.IR.8101

  16. Graziotin, D., Lenberg, P., Feldt, R., Wagner, S.: Psychometrics in behavioral software engineering: a methodological introduction with guidelines. ACM Trans. Softw. Eng. Methodol. 31(1), 1–36 (2021)

    Article  Google Scholar 

  17. Gwet, K.L.: Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters. Advanced Analytics, LLC (2014)

    Google Scholar 

  18. Hair, J., Black, W., Babin, B., Anderson, R.: Multivariate Data Analysis. Always Learning. Pearson Education Limited (2013)

    Google Scholar 

  19. Hair, J.F., Risher, J.J., Sarstedt, M., Ringle, C.M.: When to use and how to report the results of PLS-SEM. Eur. Bus. Rev. 31(1), 2–24 (2019)

    Article  Google Scholar 

  20. Harrington, D.: Confirmatory Factor Analysis. Oxford University Press, Oxford (2009)

    Google Scholar 

  21. Heilmann, C.: A new interpretation of the representational theory of measurement. Philos. Sci. 82, 787–797 (2015)

    Article  Google Scholar 

  22. Henseler, J.: Composite-Based Structural Equation Modeling. The Guilford Press (2021)

    Google Scholar 

  23. Herzig, K., Just, S., Zeller, A.: It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: 2013 35th International Conference on Software Engineering (ICSE), pp. 392–401 (2013). https://doi.org/10.1109/ICSE.2013.6606585

  24. Horn, J.: A rationale and test for the number of factors in factor analysis. Psychometrika 30, 179–185 (1965). https://doi.org/10.1007/BF02289447

    Article  Google Scholar 

  25. Howard, M.: A review of exploratory factor analysis (EFA) decisions and overview of current practices: What we are doing and how can we improve? Int. J. Hum.-Comput. Interact. 32, 150914142834000 (2015). https://doi.org/10.1080/10447318.2015.1087664

  26. Hume, D.: A Treatise of Human Nature. Oxford University Press, Oxford (1896)

    Google Scholar 

  27. ISO/IEC/IEEE International Standard – Systems and Software Engineering–Vocabulary. Standard, IEEE, Switzerland. https://doi.org/10.1109/IEEESTD.2017.8016712

  28. Johnson, P., Ekstedt, M., Jacobson, I.: Where’s the theory for software engineering? IEEE Softw. 29(5), 96–96 (2012)

    Article  Google Scholar 

  29. Johnston, R.B., Smith, S.P.: How critical realism clarifies validity issues in theory-testing research: analysis and case. In: Information Systems Foundations: The Role of Design Science, pp. 21–48. ANU Press (2010)

    Google Scholar 

  30. Kaiser, H.F.: The application of electronic computers to factor analysis. Educ. Psychol. Measur. 20(1), 141–151 (1960). https://doi.org/10.1177/001316446002000116

    Article  Google Scholar 

  31. Kaiser, H.F., Rice, J.: Little jiffy, mark IV. Educ. Psychol. Measur. 34(1), 111–117 (1974). https://doi.org/10.1177/001316447403400115

    Article  Google Scholar 

  32. Kimberlin, C.L., Winterstein, A.G.: Validity and reliability of measurement instruments used in research. Am. J. Health-Syst. Pharmacy 65(23), 2276–2284 (2008)

    Article  Google Scholar 

  33. Krippendorff, K.: Content Analysis: An Introduction to its Methodology. Sage (2018)

    Google Scholar 

  34. Kuhn, T.S.: The function of measurement in modern physical science. Isis 52(2), 161–193 (1961)

    Article  Google Scholar 

  35. Liu, Y., Schuberth, F., Liu, Y., Henseler, J.: Modeling and assessing forged concepts in tourism and hospitality using confirmatory composite analysis. J. Bus. Res. 152, 221–230 (2022). https://doi.org/10.1016/j.jbusres.2022.07.040

    Article  Google Scholar 

  36. Macdonald, S., Maclntyre, P.: The generic job satisfaction scale: Scale development and its correlates. Empl. Assist. Q. 13(2), 1–16 (1997)

    Article  Google Scholar 

  37. Martín-Escudero, P., Cabanas, A.M., Dotor-Castilla, M.L., Galindo-Canales, M., Miguel-Tobal, F., Fernández-Pérez, C., Fuentes-Ferrer, M., Giannetti, R.: Are activity wrist-worn devices accurate for determining heart rate during intense exercise? Bioengineering 10(2), 254 (2023)

    Article  Google Scholar 

  38. McGuire, S., Schultz, E., Ayoola, B., Ralph, P.: Sustainability is stratified: toward a better theory of sustainable software engineering. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp. 1996–2008 (2023). https://doi.org/10.1109/ICSE48619.2023.00169

  39. Michell, J.: Measurement in Psychology: A Critical History of a Methodological Concept, vol. 53. Cambridge University Press, Cambridge (1999)

    Book  Google Scholar 

  40. Mohanani, R., Ralph, P., Turhan, B., Mandić, V.: How templated requirements specifications inhibit creativity in software engineering. IEEE Trans. Softw. Eng. 48(10), 4074–4086 (2022). https://doi.org/10.1109/TSE.2021.3112503

    Article  Google Scholar 

  41. Mohanani, R., Turhan, B., Ralph, P.: Requirements framing affects design creativity. IEEE Trans. Softw. Eng. 47(5), 936–947 (2021). https://doi.org/10.1109/TSE.2019.2909033

    Article  Google Scholar 

  42. Parry, O., Kapfhammer, G.M., Hilton, M., McMinn, P.: A survey of flaky tests. ACM Trans. Softw. Eng. Methodol. 31(1) (2021). https://doi.org/10.1145/3476105

  43. Passmore, J.: Logical positivism. In: Edwards, P. (ed.) the Encyclopedia of Philosophy, vol. 5, pp. 52–57. Macmillan, New York (1967)

    Google Scholar 

  44. Petter, S., Straub, D., Rai, A.: Specifying formative constructs in information systems research. MIS Quarterly 623–656 (2007)

    Google Scholar 

  45. Putnick, D.L., Bornstein, M.H.: Measurement invariance conventions and reporting: the state of the art and future directions for psychological research. Dev. Rev. 41, 71–90 (2016)

    Article  Google Scholar 

  46. Ralph, P., bin Ali, N., Baltes, S., Bianculli, D., Diaz, J., Dittrich, Y., Ernst, N., Felderer, M., Feldt, R., Filieri, A., de França, B.B.N., Furia, C.A., Gay, G., Gold, N., Graziotin, D., He, P., Hoda, R., Juristo, N., Kitchenham, B., Lenarduzzi, V., Martínez, J., Melegati, J., Mendez, D., Menzies, T., Molleri, J., Pfahl, D., Robbes, R., Russo, D., Saarimäki, N., Sarro, F., Taibi, D., Siegmund, J., Spinellis, D., Staron, M., Stol, K., Storey, M.A., Taibi, D., Tamburri, D., Torchiano, M., Treude, C., Turhan, B., Wang, X., Vegas, S.: Empirical standards for software engineering research (2021). arXiv [cs.SE]. https://doi.org/10.48550/arXiv.2010.03525

  47. Ralph, P., Baltes, S., Adisaputri, G., Torkar, R., Kovalenko, V., Kalinowski, M., Novielli, N., Yoo, S., Devroey, X., Tan, X., et al.: Pandemic programming: How COVID-19 affects software developers and how their organizations can help. Empir. Softw. Eng. 25, 4927–4961 (2020)

    Article  Google Scholar 

  48. Ralph, P., Kelly, P.: The dimensions of software engineering success. In: Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pp. 24–35. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2568225.2568261

  49. Ralph, P., Tempero, E.: Construct validity in software engineering research and software metrics. In: Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering 2018, pp. 13–23 (2018)

    Google Scholar 

  50. Russo, D., Stol, K.J.: PLS-SEM for software engineering research: an introduction and survey. ACM Comput. Surv. 54(4), 1–38 (2021)

    Article  Google Scholar 

  51. Samuels, P.C.: Advice on exploratory factor analysis. Tech. rep., Birmingham City University (2017). https://api.semanticscholar.org/CorpusID:201395127

    Google Scholar 

  52. Santos, R.D.S., Ralph, P., Arshad, A., Stol, K.J.: Distributed scrum: a case meta-analysis. ACM Comput. Surv. 56(4) (2023). https://doi.org/10.1145/3626519

  53. Sayer, A.: Method in Social Science, Revised 2nd edn. Routledge, London (2010)

    Book  Google Scholar 

  54. Scott, H., Havercamp, S.M.: Measurement error. In: Volkmar, F.R. (ed.) Encyclopedia of Autism Spectrum Disorders, pp. 1817–1818. Springer, New York (2013)

    Google Scholar 

  55. Sjøberg, D.I., Bergersen, G.R.: Construct validity in software engineering. IEEE Trans. Softw. Eng. 49(3), 1374–1396 (2022)

    Article  Google Scholar 

  56. Stol, K.J., Fitzgerald, B.: Theory-oriented software engineering. Sci. Comput. Program. 101, 79–98 (2015)

    Article  Google Scholar 

  57. Tal, E.: Old and new problems in philosophy of measurement. Philo. Comp. 8(12), 1159–1173 (2013)

    Article  Google Scholar 

  58. Tal, E.: Measurement in Science. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy, Fall 2020 edn. Metaphysics Research Lab, Stanford University (2020). https://plato.stanford.edu/archives/fall2020/entries/measurement-science/

    Google Scholar 

  59. Tavakol, M., Dennick, R.: Making sense of Cronbach’s alpha. Int. J. Med. Educ. 2, 53 (2011)

    Article  Google Scholar 

  60. Tempero, E., Anslow, C., Dietrich, J., Han, T., Li, J., Lumpe, M., Melton, H., Noble, J.: The Qualitas corpus: a curated collection of Java code for empirical studies. In: 2010 Asia Pacific Software Engineering Conference, pp. 336–345. IEEE, Piscataway (2010)

    Google Scholar 

  61. Tempero, E., Ralph, P.: A framework for defining coupling metrics. Sci. Comput. Program. 166, 214–230 (2018). https://doi.org/10.1016/j.scico.2018.02.004

    Article  Google Scholar 

  62. Trochim, W., Donnelly, J.P., Arora, K.: Research Methods: The Essential Knowledge Base, 2nd edn. Cengage Learning, Boston (2016)

    Google Scholar 

  63. Velicer, W.F., Eaton, C.A., Fava, J.L.: Construct explication through factor or component analysis: a review and evaluation of alternative procedures for determining the number of factors or components. In: Goffin, R.D., Helmes, E. (eds.) Problems and Solutions in Human Assessment, pp. 41–71. Springer, Boston (2000)

    Chapter  Google Scholar 

  64. Ward, Z.B.: On Value-Laden Science. Studies in History and Philosophy of Science Part A, vol. 85, pp. 54–62 (2021)

    Article  Google Scholar 

  65. Zinbarg, R.E., Revelle, W., Yovel, I., Li, W.: Cronbach’s \(\alpha \), Revelle’s \(\beta \), and McDonald’s \(\omega _{H}\): their relations with each other and two alternative conceptualizations of reliability. Psychometrika 70, 123–133 (2005)

    Google Scholar 

Download references

Acknowledgements

This work was supported by NSERC Discovery Grant RGPIN-2020-05001, Discovery Accelerator Supplement RGPAS-2020-00081, and the Izaak Walton Killam Postdoctoral fellowship program. Thanks are due to Klaas Stol and two anonymous reviews for their constructive feedback on this chapter.

Supplementary Materials

Supplementary materials for the reliability and exploratory factor analysis—including a dataset, sample scripts, sample results, and definitions of code quality metrics—can be found at https://doi.org/10.5281/zenodo.11544897.

Competing Interests The authors have no conflicts of interest to declare that are relevant to the content of this chapter.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paul Ralph .

Editor information

Editors and Affiliations

Appendices

Appendix 1: Reliability Analysis

This appendix provides an example of analyzing reliability on software metrics computed by different tools. The metrics were calculated from the source code of Apache Maven.Footnote 6 You can find the data and scripts in the online supplement to this book (see Supplementary Materials).

The purpose of this analysis is to investigate the extent to which metrics calculated by different tools provide consistent measurements.

1.1 Data Preparation

The dataset includes various metrics such as size, cohesion, inheritance, and coupling metrics, which are continuous. The following steps were undertaken to prepare the data for reliability analysis:

  1. 1.

    Read the data from the Excel file containing the metrics.

        library(readxl)     data <- read_excel("efaReadyMC.xlsx")

  2. 2.

    Select the relevant metrics for analysis. Here we use LOC metrics computed by three different tools—Designite,Footnote 7 JHawk,Footnote 8 and Understand.Footnote 9 The corresponding columns in the dataset are Size.LOC.Designite, Size.LOC.JHawk, and Size.LOC.Understand.

        rel1_data <-         select(rel_data, 'Size.LOC.Designite',         'Size.LOC.JHawk', 'Size.LOC.Understand')

1.2 Calculate a Measure of Reliability

Since lines of code is ratio-level data, we’ll use a measure of reliability rather than agreement (see Sect. 3.3.1). For this example, we will use Cronbach’s alpha because it is simpler to calculate and interpret. Note, however, that if we were looking at reliability after doing factor analysis or a similar technique, more accurate measures of reliability such as McDonald’s omega and composite reliability are available.

We will calculate Cronbach’s alpha using the psych packageFootnote 10 in R as follows:

  1. 1.

    Convert the selected data into a dataframe.

        rel1_data <- as.data.frame(t(rel1_data))

  2. 2.

    Calculate alpha.

        library(psych)     alphaResult <- alpha(rel1_data)

1.3 Results and Interpretation

Cronbach’s alpha values range from -1 to 1, with higher values indicating greater reliability and internal consistency among the measured items. Generally, \(\alpha >0.7\) is considered acceptable, while \(\alpha >0.9\) is considered excellent. Our result \(\alpha =0.97\) indicates excellent reliability. This suggests that the three tested tools are measuring basically, if not exactly, the same thing. However, excellent reliability doesn’t mean that the studied metrics reflect the target underlying construct (e.g., class size). To determine that, we need a different kind of analysis (next).

Appendix 2: Exploratory Factor Analysis

This appendix provides an example of an exploratory factor analysis, following established guidelines [18] and using selected metrics calculated from the source code of Apache Maven.Footnote 11 You can find the data and scripts in the online supplement to this book (see Supplementary Materials on page 138).

1.1 Objective of Factor Analysis

The objective of our exploratory factor analysis is to assess the convergent and discriminant validity of common, object-oriented, class-level software code quality metrics calculated on the source code of Apache Maven where convergent validity refers to how similar the measure is with other measures it should be theoretically similar to and discriminant validity refers to how different the measure is with other measures it should theoretically be different to [62].

1.2 Design the Factor Analysis

The dataset we will use contains measurements from 22 metrics from approximately 1000 classes, well over the 10 observations per variable threshold. We aim to classify these metrics into six factors:

  1. 1.

    Cohesion: the degree to which elements of a class belong together.

  2. 2.

    In-coupling: the degree to which a class is used by other classes.

  3. 3.

    Out-coupling: the degree to which a class depends on other classes.

  4. 4.

    Size: how big the class is.

  5. 5.

    Sub-inheritance: the degree to which a class has subclasses in an inheritance hierarchy.

  6. 6.

    Sup-inheritance: the degree to which a class has superclasses in an inheritance hierarchy.

1.3 Check Assumptions of Factor Analysis

The assumptions of a factor analysis, and how we justify or test them, are as follows:

  • Factor analysis should only be used when we theorize that a latent factor structure exists. In this case, we theorize that specific factors (size, coupling, etc.) are latent and do drive changes in metrics.

  • Homogeneous sample of measurements. In other words metrics are calculated on the same sample of classes.

  • Multicollinearity. If none of the variables are correlated, we cannot perform factor analysis; however, if two or more variables are perfectly or near perfectly correlated, it will cause a “nonpositive definite” matrix, which will prevent the factor analysis from completing. We can assess multicollinearity in three ways:

    1. 1.

      Visually inspecting a correlation matrix. In this case, we can see many correlations \(>0.3\), which indicates that a factor analysis is possible [18].

    2. 2.

      The Kaiser-Meyer-Olkin (KMO) test. The KMO test tells us how correlated the variables in a dataset are. A minimum KMO value of 0.5 is acceptable and a value above 0.7 is recommended for a good factor analysis [31]. Our \(KMO = 0.71\) is considered “middling” and appropriate for factor analysis [31]. The KMO values for each individual metric are also greater than the acceptable minimum of 0.5 [31] (Fig. 8).

      Fig. 8
      figure 8

      Kaiser-Meyer-Olkin (KMO) test

    3. 3.

      Bartlett’s test of sphericity. Bartlett’s test of sphericity analyzes the correlations between variables to see if they are large enough to perform a factor analysis [14]. It was found that Bartlett’s test of sphericity is significant (\(p<0.001\)) and we can proceed with the factor analysis (Fig. 9).

      Fig. 9
      figure 9

      Bartlett’s test of sphericity

1.4 Derive Factors and Assess Fit

Researchers disagree on the best method of determining the number of factors to extract. We recommend using several methods to inform the decision [11].

Parallel analysis is a technique to estimate factors by calculating eigenvalues of random, uncorrelated data with eigenvalues of the actual data. The number of factors to retain is the count of eigenvalues greater than 0 [24]. In our data, 21 factors are retained—21 eigenvalues are greater than 0 (Fig. 10). Parallel analysis is known to overestimate the number of factors extracted from very large datasets like ours [11].

Fig. 10
figure 10

Parallel analysis

Alternatively, we retain a number of factors equal to the number of eigenvalues greater than 1 (the Kaiser Criterion [30]). This method also loses effectiveness as the size of the dataset increases [63], but not so much. We found five eigenvalues greater than 1 (Fig. 11), suggesting that we retain five factors.

Fig. 11
figure 11

Kaiser Criterion

We can also estimate the number of factors to retain by counting the eigenvalues before the bend in a scree plot [8]. This technique is a little tricky and requires expertise if the plot is complicated [11]. Figure 12 has multiple bends—at two, four, and seven eigenvalues—which suggests retaining anywhere between two and seven factors.

Fig. 12
figure 12

Scree plot

In the theory approach, we retain the number of factors that we theorize exist—in this case, six: size, cohesion, sub-inheritance, sup-inheritance, in-coupling, and out-coupling.

From the above discussion, we have the following findings:

  1. 1.

    Parallel analysis suggests 21 factors.

  2. 2.

    The Kaiser Criteria suggest five factors.

  3. 3.

    The scree plot suggests factors between 2 and 7.

  4. 4.

    Theory suggests six factors.

Based on this, we tentatively retain seven factors as shown in Table 1. Table 1 shows the variables and their corresponding factor loadings on each of the seven factors. Factor loading refers to the correlation between a variable and a factor. A high loading suggests that the variance explained by a variable is sufficient for it to have a considerable relationship with the factor. Small loadings (\(<0.3\)) are considered insignificant [11, 18, 25, 51] and are thus suppressed in our model. For example, “Cohesion.LCOM, Cohesion.LCOMModified, Cohesion.YALCOM,” and “Size.CountInstanceVariable” load together on Factor 1, which means that these variables seem to be measuring the same factor.

Table 1 EFA with seven factors

These factors explain 84% of the variance in our dataset, which is good. If the variance explained was less than 60%, we might opt to include additional factors [18]. However, Factor 5 only has two metrics loading on it. The minimum is three, so either we need more metrics or fewer factors.

In this case, we can reduce the number of factors to six, as theorized. Table 2 shows the six-factor solution. This solution explains 78% of the variance and each factor has at least three metrics, so we can move on to iteratively refining and interpreting the factors.

Table 2 EFA with six factors (step 1)

(We included this step to illustrate the realistic complexity of choosing an appropriate number of factors. Sometimes you’re well into the analysis before you figure out how many factors you should have.)

1.5 Interpret and Refine Factors

Rotating the factors helps us interpret them. In a rotated factor solution, the axes are rotated so that variables that load together are plotted closer together causing them to load highly on a single factor.

Factors can be rotated using orthogonal or oblique rotation. An oblique rotation is preferable when factors are assumed to be correlated; an orthogonal rotation is used otherwise [11, 14, 18]. Despite the popularity of orthogonal rotation (specifically varimax), oblique rotation is usually more appropriate because factors are usually correlated. Use orthogonal rotation only if you have a very good reason to believe that the factors are uncorrelated. We use oblimin rotation (a type of oblique rotation) to rotate the axis in our model.

Now we inspect the solution for problems, and remove problems one at a time, beginning with the worst. There is no algorithm for this. “Worst” is subjective. We can only give examples of problems and describe their severity. We are looking for three basic kinds of problems:

  1. 1.

    Low communality: Communality (“h2” in Tables 1 and 2) is the amount of variance in a variable that can be explained by the factor solution. Low communality (h2 \(<\) 0.5) indicates that less than half of the variance of the variable is taken into account implying that the variable is not closely related to any of the factors and causes unwanted complexity with insufficient explanation [18].

  2. 2.

    Cross-loadings: Variables with high loadings on multiple factors.

  3. 3.

    Loading on the wrong factor: Variables loading highly (loadings \(>\) 0.5) on a factor they shouldn’t be.

Looking at Table 2, Cohesion.LCOM5 has the lowest communality (\(h2=0.16\)) and loads on the wrong factor (in-coupling), so we remove that one first and rerun the EFA (Table 3). Now Size.CountDeclMethodDefault has the lowest communality (h2 = 0.25) and loads on the wrong factor (in-coupling again). So we remove it and run the EFA again (Table 4). The next variable with lowest communality is Cohesion.YALCOM (h2 = 0.45); however, it loads well on the correct factor, so we’ll retain it for now and move onto cross-loadings. Size.CountInstanceVariable loads higher on the wrong factor (cohesion) than on the correct factor (size). Thus, we remove it. Rerunning the EFA (Table 5), we find that In-Coupling.CBOin also has a cross-loading. However, it loads much higher on the correct factor (in-coupling) than the incorrect factor (out-coupling). Furthermore, the incorrect loading is smaller than the smallest correct loading in the EFA, so we retain Size.CountInstanceVariable for now.

Table 3 EFA with six factors (step 2)
Table 4 EFA with six factors (step 3)
Table 5 EFA with six factors (step 4)

Since there are no more low communalities, cross-loadings, or incorrect loadings, and the solution explains 87% of the total variance in the dataset using six factors, each of which has at least three metrics, our EFA is now complete (Table 6).

Table 6 Final EFA

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Ralph, P., Kuutila, M., Arif, H., Ayoola, B. (2024). Teaching Software Metrology: The Science of Measurement for Software Engineering. In: Mendez, D., Avgeriou, P., Kalinowski, M., Ali, N.B. (eds) Handbook on Teaching Empirical Software Engineering. Springer, Cham. https://doi.org/10.1007/978-3-031-71769-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-71769-7_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-71768-0

  • Online ISBN: 978-3-031-71769-7

  • eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics