Abstract
In this paper, we present a software clustering approach that leverages the information conveyed by the zone in which each lexeme appears in the classes of object oriented systems. We define six zones in the source code: Class Name, Attribute Name, Method Name, Parameter Name, Comment, and Source Code Statement. These zones may convey information with different levels of relevance, and so their contribution should be differently weighed according to the software system under study. To this aim, we define a probabilistic model of the lexemes distribution whose parameters are automatically estimated by the Expectation-Maximization algorithm. The weights of the zones are then exploited to compute similarities among source code classes, which are then grouped by a k-Medoid clustering algorithm. To assess the validity of our solution in the software architecture recovery field, we applied our approach to 19 software systems from different application domains. We observed that the use of our probabilistic model and the defined zones improves the quality of clustering results so that they are close to a theoretical upper bound we have proved.






Similar content being viewed by others
Notes
It is equal to the probability of the observed data x given the values of the model parameters 𝜃, i.e., \(\mathcal {L} (\theta | x) = P (x|\theta )\).
A term is a normalized type that is included in the dictionary. A type is the class of all the tokens containing the same character sequence (Manning et al. 2008).
The complete formulation is reported in Saw et al. (1984).
Given two values a and b the means percentage improvement is computed as \(\frac {(b-a)}{a}\).
Each zone has a different vocabulary, so if the same lexeme appears in two or more zones we considered different the terms in all the zones where the lexeme appears.
The Shapiro-Wilk W test returned 0.049 and 0.063 as the p-values for VSM and ZU, respectively. This confirms our postulation on the distribution of data.
The Shapiro-Wilk W test returned 0.063 for ZU and 0.04 for ZWB. On the other hand, this test returned 0.098 for ZWG. Although the results Shapiro-Wilk W test suggest that a parametric test (e.g., unpaired t-test) could be used to verify the presence of a statistically significant difference between the values of authoritativeness obtained with ZWG and ZU, we used the Mann-Whitney test because repeated statistical tests were needed to test Hn2.
References
Ali N, Gueheneuc YG, Antoniol G (2011) Requirements traceability for object oriented systems by partitioning source code. In: Proceedings of working conference on reverse engineering. IEEE Computer Society, pp 45–54
Andritsos P, Tzerpos V (2005) Information-theoretic software clustering. IEEE Trans Softw Eng 31(2):150–165
Anquetil N, Fourrier C, Lethbridge TC (1999) Experiments with clustering as a software remodularization method. In: Proceedings of working conference on reverse engineering. IEEE Computer Society, Washington, pp 235–255
Basili VR, Green S, Laitenberger O, Lanubile F, Shull F, Sørumgård LS, Zelkowitz MV (1996) The empirical investigation of perspective-based reading. Empir Softw Eng 1(2):133–164
Bavota G, De Lucia A, Marcus A, Oliveto R (2010) Software re-modularization based on structural and semantic metrics. In: Proceedings of international working conference on reverse engineering. IEEE Computer Society, pp 195–204
Bavota G, De Lucia A, Marcus A, Oliveto R (2013a) Using structural and semantic measures to improve software modularization. Empir Softw Eng 18 (5):901–932
Bavota G, Dit B, Oliveto R, Penta MD, Poshyvanyk D, Lucia AD (2013b) An empirical study on the developers’ perception of software coupling. In: Proceedings of international conference on software engineering. IEEE / ACM, pp 692–701
Bavota G, Gethers M, Oliveto R, Poshyvanyk D, De Lucia A (2014a) Improving software modularization via automated analysis of latent topics and dependencies. ACM Trans Softw Eng Methodol 23(1): 4:1–4:33. doi:10.1145/2559935
Bavota G, Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2014b) Methodbook: Recommending move method refactorings via relational topic models. IEEE Trans Softw Eng 40(7):671–694
Binkley D (2007) Source code analysis: a road map. In: Future of software engineering. IEEE Computer Society, pp 104–119
Bishop C (2006) Pattern recognition and machine learning. Information science and statistics. Springer
Bittencourt RA, Guerrero DDS (2009) Comparison of graph clustering algorithms for recovering software architecture module views. In: Proceedings of the European conference on software maintenance and reengineering. IEEE Computer Society, pp 251–254
Conover WJ (1998) Practical nonparametric statistics, 3rd. Wiley
Corazza A, Di Martino S, Maggio V, Moschitti A, Passerini A, Scanniello G, Silvestri F (2013) Using machine learning and information retrieval techniques to improve software maintainability. In: Eternal systems, communications in computer and information science. Springer, Berlin. In Press
Corazza A, Di Martino S, Maggio V, Scanniello G (2011) Investigating the use of lexical information for software system clustering. In: Proceedings of European conference on software maintenance and reengineering. IEEE Computer Society, pp 35–44
Corazza A, Di Martino S, Scanniello G (2010) A probabilistic based approach towards software system clustering. In: Proceedings of European conference on software maintenance and reengineering. IEEE Computer Society, pp 89–98
De Lucia A, Di Penta M, Oliveto R (2011) Improving source code lexicon via traceability and information retrieval. IEEE Trans Softw Eng 37(2):205–227
De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2012) Using ir methods for labeling source code artifacts: is it worthwhile? In: Proceedings of international conference on program comprehension. IEEE Computer Society Press, pp 193–202
De Lucia A, Risi M, Scanniello G, Tortora G (2009) An investigation of clustering algorithms in the comprehension of legacy web applications. J Web Eng 8(4):346–370
Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc Ser B 39 (1):1–38
van Deursen A, Hofmeister C, Koschke R, Moonen L, Riva C (2004) Symphony: view-driven software architecture reconstruction. In: Proceedings of working conference on software architecture, pp 122–134
Ducasse S, Pollet D (2009) Software architecture reconstruction: a process-oriented taxonomy. IEEE Trans Softw Eng 35(4):573–591. doi:10.1109/TSE.2009.19
Eastwood A (1993) Firm fires shots at legacy systems. Comput Canada 19(2):17
Erlikh L (2000) Leveraging legacy system dollars for e-business. IT Professional 2:17–23
Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press
Freund RJ, Wilson WJ (2003) Statistical methods, 2nd edn. Academic Press
Grubb P, Takang AA (2003) Software maintenance: concepts and practice, 2nd edn. World Scientific
Jarzabek S (2007) Effective software maintenance and evolution—a reuse-based approach. Auerbach Publ
Kampenes V, Dyba T, Hannay J, Sjoberg I (2006) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11–12):1073–1086
Kaufman L, Rousseeuw P (1990) Finding groups in data an introduction to cluster analysis. Wiley Interscience
Kevin Freedman JB (1999) Current concepts review - sample size and statistical power in clinical orthopaedic research. J Bone Joint Surg 81:1454–60
Kitchenham B, Al-Khilidar H, Babar M, Berry M, Cox K, Keung J, Kurniawati F, Staples M, Zhang H, Zhu L (2008) Evaluating guidelines for reporting empirical software engineering studies. Empir Softw Eng 13(1):97–121
Koschke R (2000) Atomic architectural component recovery for program understanding and evolution. Ph.D. thesis, University of Stuttgart
Kuhn A, Ducasse S, Girba T (2005) Enriching reverse engineering with semantic clustering. In: Proceedings of international working conference on reverse engineering. IEEE Computer Society, pp 133–142
Kuhn A, Ducasse S, Gîrba T (2007) Semantic clustering: Identifying topics in source code. Inf Softw Technol 49(3):230–243
Liu Y, Poshyvanyk D, Ferenc R, Gyimȯthy T, Chrisochoides N (2009) Modeling class cohesion as mixtures of latent topics. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 233–242
Mahdavi K (2005) A clustering genetic algorithm for software modularisation with a multiple hill climbing approach. Ph.D. thesis, Department of Information Systems and Computing, Brunel University
Maletic JI, Marcus A (2001) Supporting program comprehension using semantic and structural information. In: Proceedings of international conference on software engineering. IEEE Computer Society, Washington, pp 103–112
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Maqbool O, Babri H (2007) Hierarchical clustering for software architecture recovery. IEEE Trans Software Eng 33 (11):759–780
Marcus A, Poshyvanyk D (2005) The conceptual cohesion of classes. In: International conference on software maintenance. IEEE Computer Society, pp 133–142
Mashiko Y, Basili V (1997) Using the GQM paradigm to investigate influential factors for software process improvement. J Syst Softw 36(1):17–32
McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: Proceedings of workshop on learning for text categorization. AAAI Press, pp 41–48
Mclachlan J, Krishnan T (1996) The EM algorithm and extensions. Wiley Inter-science
Mendonça NC, Kramer J (1996) Requirements for an effective architecture recovery framework. In: Joint proceedings of the second international software architecture workshop and international workshop on multiple perspectives in software development. ACM, pp 101–105
Mitchell TM (1997) Machine learning, 1st edn. McGraw-Hill, Inc., New York
Pfleeger SL, Menezes W (2000) Marketing technology to software practitioners. IEEE Softw 17:27–33
Port O (1998) The software trap – automate or else. Bus Week 9(3051):142–154
Porter MF (1997) An algorithm for suffix stripping. Morgan Kaufmann Publishers Inc., San Francisco, pp 313–316
Poshyvanyk D, Marcus A (2006) The conceptual coupling metrics for object-oriented systems. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 469–478
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C, the art of scientific computing, 2nd edn. Cambridge University Press
Reggio G, Ricca F, Scanniello G, Di Cerbo F, Dodero G (2011) A precise style for business process modelling: Results from two controlled experiments. In: Proceedings of model driven engineering languages and systems, lecture notes in computer science. Springer, pp 138–152
Revelle M, Gethers M, Poshyvanyk D (2011) Using structural and textual information to capture feature coupling in object-oriented software. Empir Softw Eng 16(6):773–811
Risi M, Scanniello G, Tortora G (2012) Using fold-in and fold-out in the architecture recovery of software systems. Formal Asp Comput 24(3):307–330
Romano S, Scanniello G, Risi M, Gravino C (2011) Clustering and lexical information support for the recovery of design pattern in source code. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 500–503
Romesburg H (2004) Cluster analysis for researchers. Lulu Press. http://books.google.it/books?id=ZuIPv7OKm10C
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620. doi:10.1145/361219.361220
Saw JG, Yang MCK, Mo TC (1984) Chebyshev inequality with estimated mean and variance. Am Stat 38(2):130–132
Scanniello G, D’Amico A, D’Amico C, D’Amico T (2010) Using the Kleinberg algorithm and Vector Space Model for software system clustering. In: Proceedings of international conference on program comprehension. IEEE Computer Society, pp 180–189
Scanniello G, Gravino C, Marcus A, Menzies T (2013) Class level fault prediction using software clustering. In: Proceedings of international conference on automated software engineering. IEEE / ACM, pp 640–645
Scanniello G, Marcus A (2011) Clustering support for static concept location in source code. In: Proceedings of international conference on program comprehension. IEEE Computer Society, pp 1–10
Scanniello G, Marcus A, Pascale D (2014) Link analysis algorithms for static concept location: an empirical assessment. Empir Softw Eng 1–55. doi:10.1007/s10664-014-9327-7
Scanniello G, Risi M, Tortora G (2010) Architecture recovery using latent semantic indexing and k-means: an empirical evaluation. In: Proceedings of international conference on software engineering and formal methods. IEEE Computer Society, pp 103–112
Shapiro S, Wilk M (1965) An analysis of variance test for normality. Biometrika 52(3–4):591–611
Shtern M, Tzerpos V (2011) Evaluating software clustering using multiple simulated authoritative decompositions. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 353–361
Tonella P (2001) Concept analysis for module restructuring. IEEE Trans Softw Eng 27(4):351–363. doi:10.1109/32.917524
Tzerpos V, Holt RC (1999) Mojo: A distance metric for software clusterings. In: Proceedings of the working conference of reverse engineering, pp 187–193
Wen Z, Tzerpos V (2004) An effectiveness measure for software clustering algorithms. In: Proceedings of international conference on program comprehension. IEEE Computer Society, pp 194–203
Wiggerts TA (1997) Using clustering algorithms in legacy systems remodularization. In: Proceedings of working conference on reverse engineering. IEEE Computer Society, Washington, pp 33–43
Wohlin C, Runeson P, Höst M, Ohlsson M, Regnell B, Wesslén A (2000) Experimentation in software engineering - an introduction. Kluwer
Wu J, Hassan AE, Holt RC (2005) Comparison of clustering algorithms in the context of software evolution. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 525–535
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Thomas Zimmermann
Rights and permissions
About this article
Cite this article
Corazza, A., Di Martino, S., Maggio, V. et al. Weighing lexical information for software clustering in the context of architecture recovery. Empir Software Eng 21, 72–103 (2016). https://doi.org/10.1007/s10664-014-9347-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-014-9347-3