Teaching Software Metrology: The Science of Measurement for Software Engineering

Ralph, Paul; Kuutila, Miikka; Arif, Hera; Ayoola, Bimpe

doi:10.1007/978-3-031-71769-7_5

251 Accesses

Abstract

While the methodological rigor of computing research has improved considerably in the past two decades, quantitative software engineering research is hampered by immature measures and inattention to theory. Measurement—the principled assignment of numbers to phenomena—is intrinsically difficult because observation is predicated upon not only theoretical concepts but also the values and perspective of the research. Despite several previous attempts to raise awareness of more sophisticated approaches to measurement and the importance of quantitatively assessing reliability and validity, measurement issues continue to be widely ignored. The reasons are unknown, but differences in typical engineering and computer science graduate training programs (e.g., compared to psychology and management) are likely involved. This chapter therefore reviews key concepts in the science of measurement and applies them to software engineering research. A series of exercises for applying important measurement concepts to the reader’s research are included, and a sample dataset for the reader to try some of the statistical procedures mentioned is provided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Qualitative researchers can dismiss construct validity concerns by embracing interpretivism, but claiming to be an interpretivist means you’re not doing predominately quantitative research.
2.
A mathematically simple way to get the shared variance is to sum the indicators, but more sophisticated approaches, such as confirmatory factor analysis, are typically used.
3.
Even when correcting $\alpha $ for multiple comparisons, the more success dimensions we evaluate, the more likely we will find one on which new tool excels.
4.
“Operationalize” is a common term in the literature on construct validity. It refers to how we measure the construct, including our instruments and statistical approach.
5.
We have not used PyTorrent but it looks promising.
6.
https://github.com/apache/maven
7.
https://www.designite-tools.com/
8.
http://www.virtualmachinery.com/jhdownload.htm
9.
https://scitools.com/
10.
https://personality-project.org/r/psych/
11.
https://github.com/apache/maven

References

Agbo, A.A.: Cronbach’s alpha: review of limitations and associated recommendations. J. Psychol. Africa 20(2), 233–239 (2010)
Article Google Scholar
Archer, M., Bhaskar, R., Collier, A., Lawson, T., Norrie, A.: Critical Realism: Essential Readings. Routledge, London (2013)
Book Google Scholar
Bahrami, M., Shrikanth, N.C., Ruangwan, S., Liu, L., Mizobuchi, Y., Fukuyori, M., Chen, W.P., Munakata, K., Menzies, T.: Pytorrent: A python library corpus for large-scale language models (2021). arXiv [cs.SE]. https://doi.org/10.48550/arXiv.2110.01710
Baltes, S., Ralph, P.: Sampling in software engineering research: a critical review and guidelines. Empir. Softw. Eng. 27(4), 94 (2022)
Article Google Scholar
Basilevsky, A.: Statistical Factor Analysis and Related Methods: Theory and Applications. Wiley Series in Probability and Statistics. Wiley, London (1994)
Google Scholar
Briggs, D.: Historical and Conceptual Foundations of Measurement in the Human Sciences: Credos and Controversies. Routledge, London (2021)
Book Google Scholar
Campbell, N.: Physics: The Elements. Cambridge University Press, Cambridge (2013)
Google Scholar
Cattell, R.B.: The scree test for the number of factors. Multivariate Behav. Res. 1(2), 245–276 (1966)
Article Google Scholar
Cerri, L.Q., Justo, M.C., Clemente, V., Gomes, A.A., Pereira, A.S., Marques, D.R.: Insomnia severity index: a reliability generalisation meta-analysis. J. Sleep Res. 32(4), e13835 (2023)
Article Google Scholar
Coltman, T., Devinney, T.M., Midgley, D.F., Venaik, S.: Formative versus reflective measurement models: two applications of formative measurement. J. Bus. Res. 61(12), 1250–1262 (2008)
Article Google Scholar
Costello, A., Osborne, J.: Best practices in exploratory factor analysis: four recommendations for getting the most from your analysis. Pract. Assessment Res. Eval. 10, 1–9 (2005)
Google Scholar
Drost, E.A.: Validity and reliability in social science research. Educ. Res. Perspect. 38(1), 105–123 (2011)
Google Scholar
Fassott, G., Henseler, J.: Formative (measurement). In: Wiley Encyclopedia of Management. John Wiley and Sons, London (2015). https://doi.org/10.1002/9781118785317.weom090113
Field, A.: Discovering Statistics Using IBM SPSS Statistics, 5th edn. Sage (2017)
Google Scholar
Flater, D.W., Black, P.E., Fong, E.N., Kacker, R.N., Okun, V., Wood, S.S., Kuhn, D.R.: A rational foundation for software metrology. Tech. Rep. IR 8101, National Institute of Standards and Technology (2016). https://doi.org/10.6028/NIST.IR.8101
Graziotin, D., Lenberg, P., Feldt, R., Wagner, S.: Psychometrics in behavioral software engineering: a methodological introduction with guidelines. ACM Trans. Softw. Eng. Methodol. 31(1), 1–36 (2021)
Article Google Scholar
Gwet, K.L.: Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters. Advanced Analytics, LLC (2014)
Google Scholar
Hair, J., Black, W., Babin, B., Anderson, R.: Multivariate Data Analysis. Always Learning. Pearson Education Limited (2013)
Google Scholar
Hair, J.F., Risher, J.J., Sarstedt, M., Ringle, C.M.: When to use and how to report the results of PLS-SEM. Eur. Bus. Rev. 31(1), 2–24 (2019)
Article Google Scholar
Harrington, D.: Confirmatory Factor Analysis. Oxford University Press, Oxford (2009)
Google Scholar
Heilmann, C.: A new interpretation of the representational theory of measurement. Philos. Sci. 82, 787–797 (2015)
Article Google Scholar
Henseler, J.: Composite-Based Structural Equation Modeling. The Guilford Press (2021)
Google Scholar
Herzig, K., Just, S., Zeller, A.: It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: 2013 35th International Conference on Software Engineering (ICSE), pp. 392–401 (2013). https://doi.org/10.1109/ICSE.2013.6606585
Horn, J.: A rationale and test for the number of factors in factor analysis. Psychometrika 30, 179–185 (1965). https://doi.org/10.1007/BF02289447
Article Google Scholar
Howard, M.: A review of exploratory factor analysis (EFA) decisions and overview of current practices: What we are doing and how can we improve? Int. J. Hum.-Comput. Interact. 32, 150914142834000 (2015). https://doi.org/10.1080/10447318.2015.1087664
Hume, D.: A Treatise of Human Nature. Oxford University Press, Oxford (1896)
Google Scholar
ISO/IEC/IEEE International Standard – Systems and Software Engineering–Vocabulary. Standard, IEEE, Switzerland. https://doi.org/10.1109/IEEESTD.2017.8016712
Johnson, P., Ekstedt, M., Jacobson, I.: Where’s the theory for software engineering? IEEE Softw. 29(5), 96–96 (2012)
Article Google Scholar
Johnston, R.B., Smith, S.P.: How critical realism clarifies validity issues in theory-testing research: analysis and case. In: Information Systems Foundations: The Role of Design Science, pp. 21–48. ANU Press (2010)
Google Scholar
Kaiser, H.F.: The application of electronic computers to factor analysis. Educ. Psychol. Measur. 20(1), 141–151 (1960). https://doi.org/10.1177/001316446002000116
Article Google Scholar
Kaiser, H.F., Rice, J.: Little jiffy, mark IV. Educ. Psychol. Measur. 34(1), 111–117 (1974). https://doi.org/10.1177/001316447403400115
Article Google Scholar
Kimberlin, C.L., Winterstein, A.G.: Validity and reliability of measurement instruments used in research. Am. J. Health-Syst. Pharmacy 65(23), 2276–2284 (2008)
Article Google Scholar
Krippendorff, K.: Content Analysis: An Introduction to its Methodology. Sage (2018)
Google Scholar
Kuhn, T.S.: The function of measurement in modern physical science. Isis 52(2), 161–193 (1961)
Article Google Scholar
Liu, Y., Schuberth, F., Liu, Y., Henseler, J.: Modeling and assessing forged concepts in tourism and hospitality using confirmatory composite analysis. J. Bus. Res. 152, 221–230 (2022). https://doi.org/10.1016/j.jbusres.2022.07.040
Article Google Scholar
Macdonald, S., Maclntyre, P.: The generic job satisfaction scale: Scale development and its correlates. Empl. Assist. Q. 13(2), 1–16 (1997)
Article Google Scholar
Martín-Escudero, P., Cabanas, A.M., Dotor-Castilla, M.L., Galindo-Canales, M., Miguel-Tobal, F., Fernández-Pérez, C., Fuentes-Ferrer, M., Giannetti, R.: Are activity wrist-worn devices accurate for determining heart rate during intense exercise? Bioengineering 10(2), 254 (2023)
Article Google Scholar
McGuire, S., Schultz, E., Ayoola, B., Ralph, P.: Sustainability is stratified: toward a better theory of sustainable software engineering. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp. 1996–2008 (2023). https://doi.org/10.1109/ICSE48619.2023.00169
Michell, J.: Measurement in Psychology: A Critical History of a Methodological Concept, vol. 53. Cambridge University Press, Cambridge (1999)
Book Google Scholar
Mohanani, R., Ralph, P., Turhan, B., Mandić, V.: How templated requirements specifications inhibit creativity in software engineering. IEEE Trans. Softw. Eng. 48(10), 4074–4086 (2022). https://doi.org/10.1109/TSE.2021.3112503
Article Google Scholar
Mohanani, R., Turhan, B., Ralph, P.: Requirements framing affects design creativity. IEEE Trans. Softw. Eng. 47(5), 936–947 (2021). https://doi.org/10.1109/TSE.2019.2909033
Article Google Scholar
Parry, O., Kapfhammer, G.M., Hilton, M., McMinn, P.: A survey of flaky tests. ACM Trans. Softw. Eng. Methodol. 31(1) (2021). https://doi.org/10.1145/3476105
Passmore, J.: Logical positivism. In: Edwards, P. (ed.) the Encyclopedia of Philosophy, vol. 5, pp. 52–57. Macmillan, New York (1967)
Google Scholar
Petter, S., Straub, D., Rai, A.: Specifying formative constructs in information systems research. MIS Quarterly 623–656 (2007)
Google Scholar
Putnick, D.L., Bornstein, M.H.: Measurement invariance conventions and reporting: the state of the art and future directions for psychological research. Dev. Rev. 41, 71–90 (2016)
Article Google Scholar
Ralph, P., bin Ali, N., Baltes, S., Bianculli, D., Diaz, J., Dittrich, Y., Ernst, N., Felderer, M., Feldt, R., Filieri, A., de França, B.B.N., Furia, C.A., Gay, G., Gold, N., Graziotin, D., He, P., Hoda, R., Juristo, N., Kitchenham, B., Lenarduzzi, V., Martínez, J., Melegati, J., Mendez, D., Menzies, T., Molleri, J., Pfahl, D., Robbes, R., Russo, D., Saarimäki, N., Sarro, F., Taibi, D., Siegmund, J., Spinellis, D., Staron, M., Stol, K., Storey, M.A., Taibi, D., Tamburri, D., Torchiano, M., Treude, C., Turhan, B., Wang, X., Vegas, S.: Empirical standards for software engineering research (2021). arXiv [cs.SE]. https://doi.org/10.48550/arXiv.2010.03525
Ralph, P., Baltes, S., Adisaputri, G., Torkar, R., Kovalenko, V., Kalinowski, M., Novielli, N., Yoo, S., Devroey, X., Tan, X., et al.: Pandemic programming: How COVID-19 affects software developers and how their organizations can help. Empir. Softw. Eng. 25, 4927–4961 (2020)
Article Google Scholar
Ralph, P., Kelly, P.: The dimensions of software engineering success. In: Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pp. 24–35. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2568225.2568261
Ralph, P., Tempero, E.: Construct validity in software engineering research and software metrics. In: Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering 2018, pp. 13–23 (2018)
Google Scholar
Russo, D., Stol, K.J.: PLS-SEM for software engineering research: an introduction and survey. ACM Comput. Surv. 54(4), 1–38 (2021)
Article Google Scholar
Samuels, P.C.: Advice on exploratory factor analysis. Tech. rep., Birmingham City University (2017). https://api.semanticscholar.org/CorpusID:201395127
Google Scholar
Santos, R.D.S., Ralph, P., Arshad, A., Stol, K.J.: Distributed scrum: a case meta-analysis. ACM Comput. Surv. 56(4) (2023). https://doi.org/10.1145/3626519
Sayer, A.: Method in Social Science, Revised 2nd edn. Routledge, London (2010)
Book Google Scholar
Scott, H., Havercamp, S.M.: Measurement error. In: Volkmar, F.R. (ed.) Encyclopedia of Autism Spectrum Disorders, pp. 1817–1818. Springer, New York (2013)
Google Scholar
Sjøberg, D.I., Bergersen, G.R.: Construct validity in software engineering. IEEE Trans. Softw. Eng. 49(3), 1374–1396 (2022)
Article Google Scholar
Stol, K.J., Fitzgerald, B.: Theory-oriented software engineering. Sci. Comput. Program. 101, 79–98 (2015)
Article Google Scholar
Tal, E.: Old and new problems in philosophy of measurement. Philo. Comp. 8(12), 1159–1173 (2013)
Article Google Scholar
Tal, E.: Measurement in Science. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy, Fall 2020 edn. Metaphysics Research Lab, Stanford University (2020). https://plato.stanford.edu/archives/fall2020/entries/measurement-science/
Google Scholar
Tavakol, M., Dennick, R.: Making sense of Cronbach’s alpha. Int. J. Med. Educ. 2, 53 (2011)
Article Google Scholar
Tempero, E., Anslow, C., Dietrich, J., Han, T., Li, J., Lumpe, M., Melton, H., Noble, J.: The Qualitas corpus: a curated collection of Java code for empirical studies. In: 2010 Asia Pacific Software Engineering Conference, pp. 336–345. IEEE, Piscataway (2010)
Google Scholar
Tempero, E., Ralph, P.: A framework for defining coupling metrics. Sci. Comput. Program. 166, 214–230 (2018). https://doi.org/10.1016/j.scico.2018.02.004
Article Google Scholar
Trochim, W., Donnelly, J.P., Arora, K.: Research Methods: The Essential Knowledge Base, 2nd edn. Cengage Learning, Boston (2016)
Google Scholar
Velicer, W.F., Eaton, C.A., Fava, J.L.: Construct explication through factor or component analysis: a review and evaluation of alternative procedures for determining the number of factors or components. In: Goffin, R.D., Helmes, E. (eds.) Problems and Solutions in Human Assessment, pp. 41–71. Springer, Boston (2000)
Chapter Google Scholar
Ward, Z.B.: On Value-Laden Science. Studies in History and Philosophy of Science Part A, vol. 85, pp. 54–62 (2021)
Article Google Scholar
Zinbarg, R.E., Revelle, W., Yovel, I., Li, W.: Cronbach’s $\alpha $, Revelle’s $\beta $, and McDonald’s $\omega _{H}$: their relations with each other and two alternative conceptualizations of reliability. Psychometrika 70, 123–133 (2005)
Google Scholar

Download references

Acknowledgements

This work was supported by NSERC Discovery Grant RGPIN-2020-05001, Discovery Accelerator Supplement RGPAS-2020-00081, and the Izaak Walton Killam Postdoctoral fellowship program. Thanks are due to Klaas Stol and two anonymous reviews for their constructive feedback on this chapter.

Supplementary Materials

Supplementary materials for the reliability and exploratory factor analysis—including a dataset, sample scripts, sample results, and definitions of code quality metrics—can be found at https://doi.org/10.5281/zenodo.11544897.

Competing Interests The authors have no conflicts of interest to declare that are relevant to the content of this chapter.

Author information

Authors and Affiliations

Dalhousie University, Halifax, NS, Canada
Paul Ralph, Miikka Kuutila, Hera Arif & Bimpe Ayoola

Authors

Paul Ralph
View author publications
You can also search for this author in PubMed Google Scholar
Miikka Kuutila
View author publications
You can also search for this author in PubMed Google Scholar
Hera Arif
View author publications
You can also search for this author in PubMed Google Scholar
Bimpe Ayoola
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Ralph .

Editor information

Editors and Affiliations

Blekinge Institute of Technology and fortiss, Karlskrona, Sweden
Daniel Mendez
University of Groningen, Groningen, The Netherlands
Paris Avgeriou
Pontifical Catholic University of Rio de Janeiro, Rio de Janeiro, Brazil
Marcos Kalinowski
Blekinge Institute of Technology, Karlskrona, Sweden
Nauman Bin Ali

Appendices

Appendix 1: Reliability Analysis

This appendix provides an example of analyzing reliability on software metrics computed by different tools. The metrics were calculated from the source code of Apache Maven.^{Footnote 6} You can find the data and scripts in the online supplement to this book (see Supplementary Materials).

The purpose of this analysis is to investigate the extent to which metrics calculated by different tools provide consistent measurements.

1.1 Data Preparation

The dataset includes various metrics such as size, cohesion, inheritance, and coupling metrics, which are continuous. The following steps were undertaken to prepare the data for reliability analysis:

1.
Read the data from the Excel file containing the metrics.
library(readxl) data <- read_excel("efaReadyMC.xlsx")
2.
Select the relevant metrics for analysis. Here we use LOC metrics computed by three different tools—Designite,^{Footnote 7} JHawk,^{Footnote 8} and Understand.^{Footnote 9} The corresponding columns in the dataset are Size.LOC.Designite, Size.LOC.JHawk, and Size.LOC.Understand.
rel1_data <- select(rel_data, 'Size.LOC.Designite', 'Size.LOC.JHawk', 'Size.LOC.Understand')

1.2 Calculate a Measure of Reliability

Since lines of code is ratio-level data, we’ll use a measure of reliability rather than agreement (see Sect. 3.3.1). For this example, we will use Cronbach’s alpha because it is simpler to calculate and interpret. Note, however, that if we were looking at reliability after doing factor analysis or a similar technique, more accurate measures of reliability such as McDonald’s omega and composite reliability are available.

We will calculate Cronbach’s alpha using the psych package^{Footnote 10} in R as follows:

1.
Convert the selected data into a dataframe.
rel1_data <- as.data.frame(t(rel1_data))
2.
Calculate alpha.
library(psych) alphaResult <- alpha(rel1_data)

1.3 Results and Interpretation

Cronbach’s alpha values range from -1 to 1, with higher values indicating greater reliability and internal consistency among the measured items. Generally, $\alpha >0.7$ is considered acceptable, while $\alpha >0.9$ is considered excellent. Our result $\alpha =0.97$ indicates excellent reliability. This suggests that the three tested tools are measuring basically, if not exactly, the same thing. However, excellent reliability doesn’t mean that the studied metrics reflect the target underlying construct (e.g., class size). To determine that, we need a different kind of analysis (next).

Appendix 2: Exploratory Factor Analysis

This appendix provides an example of an exploratory factor analysis, following established guidelines [18] and using selected metrics calculated from the source code of Apache Maven.^{Footnote 11} You can find the data and scripts in the online supplement to this book (see Supplementary Materials on page 138).

1.1 Objective of Factor Analysis

The objective of our exploratory factor analysis is to assess the convergent and discriminant validity of common, object-oriented, class-level software code quality metrics calculated on the source code of Apache Maven where convergent validity refers to how similar the measure is with other measures it should be theoretically similar to and discriminant validity refers to how different the measure is with other measures it should theoretically be different to [62].

1.2 Design the Factor Analysis

The dataset we will use contains measurements from 22 metrics from approximately 1000 classes, well over the 10 observations per variable threshold. We aim to classify these metrics into six factors:

1.
Cohesion: the degree to which elements of a class belong together.
2.
In-coupling: the degree to which a class is used by other classes.
3.
Out-coupling: the degree to which a class depends on other classes.
4.
Size: how big the class is.
5.
Sub-inheritance: the degree to which a class has subclasses in an inheritance hierarchy.
6.
Sup-inheritance: the degree to which a class has superclasses in an inheritance hierarchy.

1.3 Check Assumptions of Factor Analysis

The assumptions of a factor analysis, and how we justify or test them, are as follows:

Factor analysis should only be used when we theorize that a latent factor structure exists. In this case, we theorize that specific factors (size, coupling, etc.) are latent and do drive changes in metrics.
Homogeneous sample of measurements. In other words metrics are calculated on the same sample of classes.
Multicollinearity. If none of the variables are correlated, we cannot perform factor analysis; however, if two or more variables are perfectly or near perfectly correlated, it will cause a “nonpositive definite” matrix, which will prevent the factor analysis from completing. We can assess multicollinearity in three ways:
1. 1.
  Visually inspecting a correlation matrix. In this case, we can see many correlations $>0.3$, which indicates that a factor analysis is possible [18].
2. 2.
  The Kaiser-Meyer-Olkin (KMO) test. The KMO test tells us how correlated the variables in a dataset are. A minimum KMO value of 0.5 is acceptable and a value above 0.7 is recommended for a good factor analysis [31]. Our $KMO = 0.71$ is considered “middling” and appropriate for factor analysis [31]. The KMO values for each individual metric are also greater than the acceptable minimum of 0.5 [31] (Fig. 8).
  Fig. 8
  Kaiser-Meyer-Olkin (KMO) test
  Full size image
3. 3.
  Bartlett’s test of sphericity. Bartlett’s test of sphericity analyzes the correlations between variables to see if they are large enough to perform a factor analysis [14]. It was found that Bartlett’s test of sphericity is significant ($p<0.001$) and we can proceed with the factor analysis (Fig. 9).
  Fig. 9
  Bartlett’s test of sphericity
  Full size image

1.4 Derive Factors and Assess Fit

Researchers disagree on the best method of determining the number of factors to extract. We recommend using several methods to inform the decision [11].

Parallel analysis is a technique to estimate factors by calculating eigenvalues of random, uncorrelated data with eigenvalues of the actual data. The number of factors to retain is the count of eigenvalues greater than 0 [24]. In our data, 21 factors are retained—21 eigenvalues are greater than 0 (Fig. 10). Parallel analysis is known to overestimate the number of factors extracted from very large datasets like ours [11].

Alternatively, we retain a number of factors equal to the number of eigenvalues greater than 1 (the Kaiser Criterion [30]). This method also loses effectiveness as the size of the dataset increases [63], but not so much. We found five eigenvalues greater than 1 (Fig. 11), suggesting that we retain five factors.

We can also estimate the number of factors to retain by counting the eigenvalues before the bend in a scree plot [8]. This technique is a little tricky and requires expertise if the plot is complicated [11]. Figure 12 has multiple bends—at two, four, and seven eigenvalues—which suggests retaining anywhere between two and seven factors.

In the theory approach, we retain the number of factors that we theorize exist—in this case, six: size, cohesion, sub-inheritance, sup-inheritance, in-coupling, and out-coupling.

From the above discussion, we have the following findings:

1.
Parallel analysis suggests 21 factors.
2.
The Kaiser Criteria suggest five factors.
3.
The scree plot suggests factors between 2 and 7.
4.
Theory suggests six factors.

Based on this, we tentatively retain seven factors as shown in Table 1. Table 1 shows the variables and their corresponding factor loadings on each of the seven factors. Factor loading refers to the correlation between a variable and a factor. A high loading suggests that the variance explained by a variable is sufficient for it to have a considerable relationship with the factor. Small loadings ($<0.3$) are considered insignificant [11, 18, 25, 51] and are thus suppressed in our model. For example, “Cohesion.LCOM, Cohesion.LCOMModified, Cohesion.YALCOM,” and “Size.CountInstanceVariable” load together on Factor 1, which means that these variables seem to be measuring the same factor.

Table 1 EFA with seven factors

Full size table

These factors explain 84% of the variance in our dataset, which is good. If the variance explained was less than 60%, we might opt to include additional factors [18]. However, Factor 5 only has two metrics loading on it. The minimum is three, so either we need more metrics or fewer factors.

In this case, we can reduce the number of factors to six, as theorized. Table 2 shows the six-factor solution. This solution explains 78% of the variance and each factor has at least three metrics, so we can move on to iteratively refining and interpreting the factors.

Table 2 EFA with six factors (step 1)

Full size table

(We included this step to illustrate the realistic complexity of choosing an appropriate number of factors. Sometimes you’re well into the analysis before you figure out how many factors you should have.)

1.5 Interpret and Refine Factors

Rotating the factors helps us interpret them. In a rotated factor solution, the axes are rotated so that variables that load together are plotted closer together causing them to load highly on a single factor.

Factors can be rotated using orthogonal or oblique rotation. An oblique rotation is preferable when factors are assumed to be correlated; an orthogonal rotation is used otherwise [11, 14, 18]. Despite the popularity of orthogonal rotation (specifically varimax), oblique rotation is usually more appropriate because factors are usually correlated. Use orthogonal rotation only if you have a very good reason to believe that the factors are uncorrelated. We use oblimin rotation (a type of oblique rotation) to rotate the axis in our model.

Now we inspect the solution for problems, and remove problems one at a time, beginning with the worst. There is no algorithm for this. “Worst” is subjective. We can only give examples of problems and describe their severity. We are looking for three basic kinds of problems:

1.
Low communality: Communality (“h2” in Tables 1 and 2) is the amount of variance in a variable that can be explained by the factor solution. Low communality (h2 $<$ 0.5) indicates that less than half of the variance of the variable is taken into account implying that the variable is not closely related to any of the factors and causes unwanted complexity with insufficient explanation [18].
2.
Cross-loadings: Variables with high loadings on multiple factors.
3.
Loading on the wrong factor: Variables loading highly (loadings $>$ 0.5) on a factor they shouldn’t be.

Looking at Table 2, Cohesion.LCOM5 has the lowest communality ($h2=0.16$) and loads on the wrong factor (in-coupling), so we remove that one first and rerun the EFA (Table 3). Now Size.CountDeclMethodDefault has the lowest communality (h2 = 0.25) and loads on the wrong factor (in-coupling again). So we remove it and run the EFA again (Table 4). The next variable with lowest communality is Cohesion.YALCOM (h2 = 0.45); however, it loads well on the correct factor, so we’ll retain it for now and move onto cross-loadings. Size.CountInstanceVariable loads higher on the wrong factor (cohesion) than on the correct factor (size). Thus, we remove it. Rerunning the EFA (Table 5), we find that In-Coupling.CBOin also has a cross-loading. However, it loads much higher on the correct factor (in-coupling) than the incorrect factor (out-coupling). Furthermore, the incorrect loading is smaller than the smallest correct loading in the EFA, so we retain Size.CountInstanceVariable for now.

Table 3 EFA with six factors (step 2)

Full size table

Table 4 EFA with six factors (step 3)

Full size table

Table 5 EFA with six factors (step 4)

Full size table

Since there are no more low communalities, cross-loadings, or incorrect loadings, and the solution explains 87% of the total variance in the dataset using six factors, each of which has at least three metrics, our EFA is now complete (Table 6).

Table 6 Final EFA

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ralph, P., Kuutila, M., Arif, H., Ayoola, B. (2024). Teaching Software Metrology: The Science of Measurement for Software Engineering. In: Mendez, D., Avgeriou, P., Kalinowski, M., Ali, N.B. (eds) Handbook on Teaching Empirical Software Engineering. Springer, Cham. https://doi.org/10.1007/978-3-031-71769-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-71769-7_5
Published: 25 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-71768-0
Online ISBN: 978-3-031-71769-7
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics

Teaching Software Metrology: The Science of Measurement for Software Engineering

Abstract

Access this chapter

Subscribe and save

Buy Now

Notes

References

Acknowledgements

Supplementary Materials

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendix 1: Reliability Analysis

1.1 Data Preparation

1.2 Calculate a Measure of Reliability

1.3 Results and Interpretation

Appendix 2: Exploratory Factor Analysis

1.1 Objective of Factor Analysis

1.2 Design the Factor Analysis

1.3 Check Assumptions of Factor Analysis

1.4 Derive Factors and Assess Fit

1.5 Interpret and Refine Factors

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us