Abstract
During software evolution, one of the most important comprehension activities is concept location in source code, as it identifies the places in the code where changes are to be made in response to a modification request. Change requests (such as, bug fixing or new feature requests) are usually formulated in natural language, while the source code also includes large amounts of text. In consequence, many of the existing concept location techniques are based on text search or text retrieval. Such approaches reformulate concept location as a document retrieval problem. We refine and improve such solutions by leveraging dependencies between source code elements. Dependency information is used by a link analysis algorithm to rank the document space and to improve concept location based on text retrieval. We implemented our solution to concept location using the PageRank algorithm, used in web document retrieval applications. The results of an empirical evaluation indicate that the new approach leads to better retrieval performance than baseline approaches that use text retrieval and clustering. In addition, we present the results of a controlled experiment and of a differentiated replication to assess whether the new technique supports users in identifying the places in the code where changes are to be made. The results of these experiments revealed that the users exploiting our technique were significantly better supported in the identification of the code to be changed in response to a bug fixing request compared to the users who did not use this technique.
Similar content being viewed by others
Notes
Applying Bonferroni the corrected α value is equal to \(\frac {0.05}{9} = 0.0055\), where 9 is the number of systems studied in our empirical evaluation. We used this corrected α value, when the data from all the software systems have been analyzed together.
Although there is no formal standard for the power of a statistical test, the value 0.80 is considered as a reasonable threshold for adequacy (Ellis 2010)
Differentiated replications introduce variations in essential aspects of the experimental conditions (Basili et al. 1999). One prominent variation concerns the executions of replications with different kinds of participants and different design. In Shull et al. (2008), this kind of replication is also named independent or conceptual replication.
In Italy, the exam grades are expressed as integers and assume values in between 18 and 30. The lowest grade is 18, while the highest is 30.
They are line graphs in which the means of the dependent variables for each level of one factor are plotted over all the levels of the second factor. If the lines are nearly parallel, then no interaction is present, and an interaction is present otherwise. Intersecting lines are a clear evidence of an interaction between factors.
We chose boxplots to show the results, rather than clustered bar charts, for example, because of the different designs used in the experiments and the different number of participants in these two experiments. For instance, in USB1 all the participants answered the questions from Q1 to Q5, while only those used PR answered the question from Q6 to Q9. In USB2 all the participants answered all the questions. The adoption of clustered bar chart could then introduce some distortions, when summarizing the post-experiment data for the discussion.
Given two values (a, b), it is computed as (a−b)/b∗100
References
Abadi A, Nisenson M, Simionovici Y (2008) A traceability technique for specifications. In: International conference on program comprehension. IEEE CS Press, Washington, DC, pp 103–112
Abrahão S, Gravino C, Pelozo EI, Scanniello G, Tortora G (2013) Assessing the effectiveness of sequence diagrams in the comprehension of functional requirements: results from a family of five experiments. IEEE Trans Soft Eng 39 (3):327–342
Ali N, Sabane A, Guéhéneuc Y-G, Antoniol G (2012) Improving bug location using binary class relationships. In: Proceedings of international working conference on source code analysis and manipulation (SCAM). IEEE Computer Society, Washington, DC, p 174–183
Aranda J, Ernst N, Horkoff J, Easterbrook S (2007) A framework for empirical evaluation of model comprehensibility. In: Proceedings of modeling in software engineering. ICSE Workshop, pp 7–13. IEEE
Arisholm E, Briand LC, Hove SE, Labiche Y (2006) The impact of UML documentation on software maintenance: an experimental evaluation. IEEE Trans Soft Eng 32:365–381
Bajracharya SK, Ngo TC, Linstead E, Dou Y, Rigor P, Baldi P, Lopes CV (2006) Sourcerer: a search engine for open source code supporting structure-based search. In: Tarr PL, Cook WR (eds) Companion to the 21th annual ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications (OOPSLA), Portland, pp 681–682. ACM
Basili V, Caldiera G, Rombach DH (1994) The goal question metric paradigm, encyclopedia of software engineering. Wiley
Basili VR, Shull F, Lanubile F (1999) Building knoledge through families of experiments. In: IEEE Transactions on Software Engineering, IEEE
Beard M, Kraft N, Etzkorn L, Lukins S (2011) Measuring the accuracy of information retrieval based bug localization techniques. In: Proceedings of working conference on reverse engineering (WCRE). IEEE Computer Society, Washington, DC, pp 124–128
Briand LC, Labiche Y, Di Penta M, Yan-Bondoc H (2005) An experimental investigation of formality in UML-based development. IEEE Trans Soft Eng 31 (10):833–849
Brien MPO, Buckley J (2005) Modelling the information-seeking behaviour of programmers - an empirical approach. In: Proceedings of workshop on program comprehension (IWPC). IEEE Computer Society, pp 125–134
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the seventh international conference on World Wide Web 7, (WWW7). Elsevier, Amsterdam, pp 107–117
Buckner J, Buchta J, Petrenko M, Jripples V (2005) Rajlich: a tool for program comprehension during incremental change. In: Proceedings of international workshop on program comprehension, (IWPC). IEEE Computer Society, pp 149–152
Carver J, Jaccheri L, Morasca S, Shull F (2003) Issues in using students in empirical studies in software engineering education. In: Proceedings of international symposium on software metrics. IEEE Computer Society, Washington, DC, pp 239–250
Chan W-K, Cheng H, Lo D (2012) Searching connected API subgraph via text phrases. In: Proceedings of symposium on the foundations of software engineering. SIGSOFT FSE. ACM, p 10
Chen K, Rajlich V (2000) Case study of feature location using dependence graph. In: Proc. of 8th international workshop on program comprehension, pp 241–247
Ciolkowski M, Muthig D, Rech J (2004) Using academic courses for empirical validation of software development processes. In: Proceedings of EUROMICRO Conference. IEEE Computer Society, Washington, DC, pp 354–361
Cliff N (1993) Dominance statistics: ordinal analyses to answer ordinal questions. Psychol Bull 114 (3):494–509
Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn., Lawrence Earlbaum Associates, Hillsdale
Colosimo M, De Lucia A, Scanniello G, Tortora G (2009) Evaluating legacy system migration technologies through empirical studies. Inf Soft Technol 51 (12):433–447
Conover WJ (1998) Practical Nonparametric Statistics, 3rd edn. Wiley
Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41 (6):391–407
Devore JL, Farnum N (1999) Applied statistics for engineers and scientists. Duxbury
De Lucia A, Oliveto R, Tortora G (2009) Assessing ir-based traceability recovery tools through controlled experiments. Empirical Softw Eng 14 (1):57–92
Dit B, Revelle M, Poshyvanyk D (2013a) Integrating information retrieval, execution and link analysis algorithms to improve feature location in software. Empirical Softw Engg 18(2):277–309. doi:10.1007/s10664-011-9194-4
Dit B, Revelle M, GethersM, Poshyvanyk D (2013b) Feature location in source code: a taxonomy and survey. Journal of Software: Evolution and Process 25(1):53–95. doi:10.1002/smr.567
Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56:52–64
Eaddy M, Aho AV, Antoniol G, Guéhéneuc Y-G (2008) Cerberus: tracing requirements to source code using information retrieval, dynamic analysis, and program analysis. In: Proceedings of international conference on program comprehension, ICPC ’08. IEEE Computer Society, Washington, DC, pp 53–62
Ellis P (2010) The essential guide to effect sizes: statistical power, meta-analysis, and the interpretation of research results. Cambridge University Press
Gay G, Haiduc S, Marcus A, Menzies T (2009) On the use of relevance feedback in IR-based concept location. In: Proceedings of international conference on software maintenance. IEEE Computer Society, Washington, DC, pp 351–360
Gold N, Harman M, Li Z, Mahdavi K (2006) Allowing overlapping boundaries in source code using a search based approach to concept binding. In: Proceedings of international conference on software maintenance, (ICSM). IEEE Computer Society, Washington, DC, pp 310–319
Grant S, Cordy JR, Skillicorn D, Automated concept location using independent component analysis. In: Proceedings of working conference on reverse engineering WCRE (2008). IEEE Computer Society, Washington, DC, pp 138–142
Gravino C, Risi M, Scanniello G, Tortora G (2012) Do professional developers benefit from design pattern documentation? A replication in the context of source code comprehension. In: Proceedings of conference on model driven engineering languages and systems, lecture notes in computer science, Springer, pp 185–201
Grechanik M, Fu C, Xie Q, McMillan C, Poshyvanyk D, Cumby C (2010) A search engine for finding highly relevant applications. In: Proceedings of international conference on software engineering, ICSE, vol 1, ACM, New York
Haiduc S, Bavota G, Marcus A, Oliveto R, De Lucia A, Menzies T (2013) Automatic query reformulations for text retrieval in software engineering. In: Proceedings of international conference on software engineering, ICSE. IEEE Press, Piscataway, pp 842–851
Hannay J, Jørgensen M (2008) The role of deliberate artificial design elements in software engineering experiments. IEEE Trans Softw Eng 34 (2):242–259
Harman M, Gold N, Hierons RM, Binkley D (2002) Code extraction algorithms which unify slicing and concept assignment. In: Proceedings of working conference on reverse engineering, WCRE. IEEE Computer Society, Richmond, pp 11–21
Hill E, Pollock L, Vijay-Shanker K (2007) Exploring the neighborhood with dora to expedite software maintenance. In: Proceedings of international conference on automated software engineering, ASE, ACM, New York
Inoue K, Yokomori R, Yamamoto T, Matsushita M, Kusumoto S (2005) Ranking significance of software components based on use relations. IEEE Trans Softw Eng 31 (3):213–225
Juristo N, Moreno A (2001) Basics of software engineering experimentation. Kluwer Academic Publishers, Englewood Cliffs
Kampenes VB, Dybå T, Hannay JE, Sjøberg DIK (2007) A systematic review of effect size in software engineering experiments. Inf Soft Technol 49 (11–12):1073–1086
Kitchenham B, Al-Khilidar H, Babar M, Berry M, Cox K, Keung J, Kurniawati F, Staples M, Zhang H, Zhu L (2008) Evaluating guidelines for reporting empirical software engineering studies. Empir Soft Eng 13:97–121
Ko AJ, Myers BA, Coblenz MJ, Aung HH (2006) An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks. IEEE Trans Soft Eng 32 (12):971–987
Li Z (2009) Identifying high-level dependence structures using slice-based dependence analysis. In: 25th IEEE international conference on software maintenance (ICSM). Edmonton, pp 457–460. IEEE
Lukins SK, Kraft NA, Etzkorn LH (2008) Source code retrieval for bug localization using latent dirichlet allocation. In: Proceedings of working conference on reverse engineering, WCRE. IEEE Computer Society, Washington, DC, pp 155–164
Lukins SK, Kraft NA, Etzkorn LH (2010) Bug localization using latent dirichlet allocation. Inf Softw Technol 52 (9):972–990
Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Marcus A, Haiduc S (2013) Text retrieval approaches for concept location in source code. In: Software engineering, volume 7171 of lecture notes in computer science. Springer, pp 126–158
Marcus A, Maletic J (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: Proceedings of international conference on software engineering, ICSE. IEEE Computer Society, Portland, pp 124–135
Marcus A, Sergeyev A, Rajlich V, Maletic JI (2004) An information retrieval approach to concept location in source code. In: Proceedings of working conference on reverse engineering, WCRE’ 04. IEEE Computer Society, Washington, DC, pp 214–223
McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Fu C (2011) Portfolio: finding relevant functions and their usage. In: Proceedings of International Conference on Software Engineering, ICSE, ACM, New York
McMillan C, Grechanik M, Poshyvanyk D, Fu C, Xie Q (2012) Exemplar: a source code search engine for finding highly relevant applications. IEEE Trans Soft Eng 38 (5):1069–1087
Moreno L, Bandara W, Haiduc S, Marcus A (2013) On the relationship between the vocabulary of bug reports and source code. In: International conference on software maintenance, ICSM, IEEE Computer Society
Ngomo ACN (2009) Low-bias extraction of domain-specific concepts. Ph.D Thesis
Oppenheim AN (1992) Questionnaire design, interviewing and attitude measurement. Pinter, London
Panichella A, McMillan C, Moritz E, Palmieri D, Oliveto R, Poshyvanyk D, De Lucia A (2013) When and how using structural information to improve ir-based traceability recovery. In: European conference on software maintenance and reengineering, CSMR. IEEE Computer Society, Washington, DC, pp 199– 208
Petrenko M., Rajlich V. (2013) Concept location using program dependencies and information retrieval (depir). Inf Softw Technol 55 (4):651–659
Poshyvanyk D, Gethers M, Marcus A, Concept location using formal concept analysis and information retrieval (2013). ACM Trans Softw Eng Methodol 21 (4):23:1–23:34
Poshyvanyk D., Marcus A (2007) Combining formal concept analysis with information retrieval for concept location in source code. In: Proceedings of the 15th ieee international conference on program comprehension, ICPC. IEEE Computer Society, Washington, DC, pp 37–48
Puppin D, Silvestri F (2006) The social network of java classes. In: Proceedings of symposium on applied computing, (SAC), ACM, New York
Rajlich V, Wilde N (2002) The role of concepts in program comprehension. In: Proceedings of international workshop on program comprehension, IWP. IEEE Computer Society, Washington, DC, pp 271–278
Revelle M, Dit B, Poshyvanyk D (2010) Using data fusion and web mining to support feature location in software. In: Proceedings of international conference on program comprehension, ICPC. IEEE Computer Society, Washington, DC, pp 14–23
Ricca F, Di Penta M, Torchiano M, Tonella P, Ceccato M (2010) How developers’ experience and ability influence Web application comprehension tasks supported by UML stereotypes: a series of four experiments. IEEE Trans Soft Eng 36 (1):96–118
Robillard MP (2008) Topology analysis of software dependencies. ACM Trans Softw Eng Methodol 17 (4):18:1–18:36
Romano J, Kromrey JD, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: should we really be using t-test and cohen’s d for evaluating group differences on the nsse and other surveys? In: Annual meeting of the Florida association of institutional research
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw Hill, New York
Scanniello G, D’Amico A, D’Amico C, D’Amico T (2010) Using the kleinberg algorithm and vector space model for software system clustering. In: International conference on program comprehension, ICPC. IEEE Computer Society, Washington, DC, pp 180–189
Scanniello G, Gravino C, Genero M, Cruz-Lemus JA, Tortora G (2014) On the impact of UML analysis models on source code comprehensibility and modifiability. ACM Trans Sofw Eng Meth 23 (2):13:1–13:26
Scanniello G, Gravino C, Tortora G (2010) Investigating the role of UML in the software modeling and maintenance - a preliminary industrial survey. In: Proceedings of the international conference on enterprise information systems. pp 141–148
Scanniello G, Marcus A (2011) Clustering support for static concept location in source code. In: Proceedings of international conference on program comprehension, ICPC. IEEE Computer Society, Washington, DC, pp 1–10
Seaman CB (2002) The information gathering strategies of software maintainers. In: Proceedings of the international conference on software maintenance, ICSM. IEEE Computer Society, Washington, DC, pp 141–149
Shapiro S, Wilk M (1965) An analysis of variance test for normality. Biometrika 52 (3–4):591–611
Shull FJ, Carver JC, Vegas S, Juristo N (2008) The role of replications in empirical software engineering. Empir Soft Eng 13 (2):211–218
Sjoberg DIK, Hannay JE, Hansen O, Kampenes VB, Karahasanovic A, Liborg N, Rekdal AC (2005) A survey of controlled experiments in software engineering. IEEE Trans Soft Eng 31 (9):733–753
Wang J, Peng X, Xing Z, Zhao W (2011) An exploratory study of feature location process: distinct phases, recurring patterns, and elementary actions. In: Proceedings of international conference on software maintenance, ICSM. IEEE Computer Society, pp 213–222
Wang S, Lo D, Jiang L (2011) Code search via topic-enriched dependence graph matching. In: Working conference on reverse engineering, WCRE. IEEE Computer Society, pp 119–123
Wang S., Lo D., Xing Z., Jiang L. (2011) Concern localization using information retrieval: an empirical study on linux kernel. In: Proceedings of working conference on reverse engineering, WCRE. IEEE Computer Society, pp 92–96
Wohlin C, Runeson P, Höst M, Ohlsson M, Regnell B, Wesslén A (2012) Experimentation in software engineering. Springer
Zhao W, Zhang L, Liu Y, Sun J, Yang F (2004) Sniafl: towards a static non-interactive approach to feature location. In: Proceedings of international conference on software engineering, ICSE. IEEE Computer Society, Washington, DC, pp 293–303
Zhou J, Zhang H, Lo D (2012) Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In: International conference on software engineering, ICSE. IEEE pp 14–24
Acknowledgments
We would like to thank Michele Brescia, who developed some of the software modules of the prototype used in the experimentation presented here, and Pasquale Ricciardi for helping us in the execution of the replication. We also thank the participants in the controlled experiments. Andrian Marcus was supported in part by grants from the US National Science Foundation: CCF-1017263 and CCF-0845706.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Ahmed E. Hassan
Appendix :
Appendix :
In this appendix, we summarize CLC (Scanniello and Marcus 2011), namely one of the baseline approaches we have selected for the investigation presented in this paper. The main steps are:
- 1. Corpus Creation. :
-
Each method results in one document in the corpus. All the comments and identifiers of a method are included in a document. Lead comments for the methods (if any) were also included in the corresponding document.
- 2. Corpus Normalization. :
-
The normalization is performed as for PR (see Section 3).
- 3. Corpus Indexing.:
-
A text retrieval engine is used to index the corpus. A numerical index associated with each document in the corpus is created. Later, this index is used to determine similarity measures between documents. We used VSM as text retrieval engine.
- 4. Computing Lexical Similarities Between Source Code Documents.:
-
It is a necessary step to perform the clustering. We use the cosine similarity and compute it between all the documents in the corpus.
- 5. Extracting Dependencies in Software.:
-
We represent the software system as a directed graph G=(V,E). V is the set of methods in the system, while E is the set of edges (i.e., ordered pair of elements of V). Each edge represents a directed relationship between two methods. We take a conservative approach in this work and only consider direct references between methods. That is, (m i ,m j )∈E, if there is a reference to the method m j in the body of the method m i . These dependencies have been identified by employing JRipples.
- 6. Clustering.:
-
x The graph G is is turned into a directed weighted graph G ′=(V,E,ω). In particular, the lexical similarity (i.e., cosine similarity) between two methods m i and m j is used as the weight (i.e., ω(m i ,m j )) of the edge (if present) between the nodes corresponding to these methods. According to how the graph G ′ is built, we can assert that it summarizes both the structural and the lexical information of a subject system.
The BorderFlow clustering algorithm (Ngomo 2009) is applied to G’. The algorithm is a general-purpose graph clustering algorithm. It can be used for soft clustering (i.e., a node of an input graph can be in one or more clusters) and hard clustering (i.e., each node of an input graph can be in exactly one cluster). The hard clustering variant is used in CLC.
The idea behind BorderFlow is to maximize the flow from the border of each cluster to its inner nodes (i.e., the nodes within the cluster) while minimizing the flow from the cluster to the nodes outside of the cluster. Therefore, a cluster X is a subset of V such that a cluster maximizes the border flow ratio:
$$F(X) = \frac{\Omega(b(X), X)}{\Omega(b(X), n(X))} $$where b(X) is the set of border nodes of X, while n(X) is a function used to identify the set of direct neighbors of X. Ω is a function that assigns the total weight of the edges from a subset of V to another one to these subsets (i.e., the flow between the first and the second subset). This function is computed as follows:
$${\Omega} (X, Y) = \sum\limits_{x \in X, y \in Y} \omega(x, y)$$The algorithm iteratively selects nodes from n(X) and then inserts them in X until F(X) is maximized. The selection of the nodes is performed according to the following two steps:
-
1.
Computing the set C(X) that will contain all the nodes u∈X−V such that \(F(X \bigcup \{u\}) > F(X)\).
-
2.
Selecting the candidates u∈C(X) to get the set C f (X). This set contains all the nodes u that maximize Ω(u,n(X)).
If \(F(X \cup C_{f}(X)) \geqslant F(X)\), then the nodes of C f (X) are added to the set X. The iterative selection of nodes concludes when |n(X)| equals to 0 for each set of nodes X identified by the BorderFlow algorithm. Each set of nodes forms a cluster.
-
1.
- 7. Formulating a Query.:
-
Developers formulate textual queries based on the information they have about the change request. Most text retrieval engines do not rely on a predefined vocabulary or grammar; hence the queries do not need to be correct sentences. The query is normalized in same way as the corpus.
- 8. Ranking the Documents.:
-
In text retrieval-based approaches documents are retrieved based on their lexical similarity to the query. In CLC, the position of the methods in the ranked list is modified according to the clustering results. Specifically, for all the clusters c k ∈C, where C is the set of identified clusters, we compute the similarity of each method m i ∈c k with the query q as follows:
$$S(q, m_{i}, c_{k}) = \max\limits_{{m_{j} \in c_{k}}}\{sim(q, m_{j}) | i \neq j \}$$where s i m(q,m j ) is the lexical similarity (or cosine similarity) between then query q and method m j . The methods are then sorted to get a new ranked list. From a practical perspective, CLC no longer retrieves individual methods but instead clusters of related methods (related both structurally and lexically). The retrieval order of the clusters is still based on the lexical similarity to the use’s query.
Differences and Similarity between CLC and the New Approach. Steps 1, 2, and 3 are similar to the steps 1, 2, and 3 of the new approach (see Section 3). Also, the steps Extracting Dependencies in Software and Formulating a Query are similar in both the approaches. The most relevant differences are concerned with the use of the BorderFlow clustering algorithm, the computation of the lexical similarity among pairs of methods that is not required in the new approach, and how the ranked list is obtained. Due to these differences, the new approach scales better on larger systems.
Rights and permissions
About this article
Cite this article
Scanniello, G., Marcus, A. & Pascale, D. Link analysis algorithms for static concept location: an empirical assessment. Empir Software Eng 20, 1666–1720 (2015). https://doi.org/10.1007/s10664-014-9327-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-014-9327-7