Weighing lexical information for software clustering in the context of architecture recovery

Corazza, Anna; Di Martino, Sergio; Maggio, Valerio; Scanniello, Giuseppe

doi:10.1007/s10664-014-9347-3

Weighing lexical information for software clustering in the context of architecture recovery

Published: 21 March 2015

Volume 21, pages 72–103, (2016)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Anna Corazza¹,
Sergio Di Martino¹,
Valerio Maggio² &
…
Giuseppe Scanniello³

773 Accesses
1 Altmetric
Explore all metrics

Abstract

In this paper, we present a software clustering approach that leverages the information conveyed by the zone in which each lexeme appears in the classes of object oriented systems. We define six zones in the source code: Class Name, Attribute Name, Method Name, Parameter Name, Comment, and Source Code Statement. These zones may convey information with different levels of relevance, and so their contribution should be differently weighed according to the software system under study. To this aim, we define a probabilistic model of the lexemes distribution whose parameters are automatically estimated by the Expectation-Maximization algorithm. The weights of the zones are then exploited to compute similarities among source code classes, which are then grouped by a k-Medoid clustering algorithm. To assess the validity of our solution in the software architecture recovery field, we applied our approach to 19 software systems from different application domains. We observed that the use of our probabilistic model and the defined zones improves the quality of clustering results so that they are close to a theoretical upper bound we have proved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating the Effectiveness of Multi-level Greedy Modularity Clustering for Software Architecture Recovery

Detecting, classifying, and tracing non-functional software requirements

Article 04 May 2016

Reconstructing and evolving software architectures using a coordinated clustering framework

Article 07 February 2017

Notes

It is equal to the probability of the observed data x given the values of the model parameters 𝜃, i.e., $\mathcal {L} (\theta | x) = P (x|\theta )$.
A term is a normalized type that is included in the dictionary. A type is the class of all the tokens containing the same character sequence (Manning et al. 2008).
http://www2.unibas.it/gscanniello/software-clustering/
The complete formulation is reported in Saw et al. (1984).
Given two values a and b the means percentage improvement is computed as $\frac {(b-a)}{a}$.
Each zone has a different vocabulary, so if the same lexeme appears in two or more zones we considered different the terms in all the zones where the lexeme appears.
The Shapiro-Wilk W test returned 0.049 and 0.063 as the p-values for VSM and ZU, respectively. This confirms our postulation on the distribution of data.
The Shapiro-Wilk W test returned 0.063 for ZU and 0.04 for ZWB. On the other hand, this test returned 0.098 for ZWG. Although the results Shapiro-Wilk W test suggest that a parametric test (e.g., unpaired t-test) could be used to verify the presence of a statistically significant difference between the values of authoritativeness obtained with ZWG and ZU, we used the Mann-Whitney test because repeated statistical tests were needed to test Hn2.

References

Ali N, Gueheneuc YG, Antoniol G (2011) Requirements traceability for object oriented systems by partitioning source code. In: Proceedings of working conference on reverse engineering. IEEE Computer Society, pp 45–54
Andritsos P, Tzerpos V (2005) Information-theoretic software clustering. IEEE Trans Softw Eng 31(2):150–165
Article Google Scholar
Anquetil N, Fourrier C, Lethbridge TC (1999) Experiments with clustering as a software remodularization method. In: Proceedings of working conference on reverse engineering. IEEE Computer Society, Washington, pp 235–255
Google Scholar
Basili VR, Green S, Laitenberger O, Lanubile F, Shull F, Sørumgård LS, Zelkowitz MV (1996) The empirical investigation of perspective-based reading. Empir Softw Eng 1(2):133–164
Article Google Scholar
Bavota G, De Lucia A, Marcus A, Oliveto R (2010) Software re-modularization based on structural and semantic metrics. In: Proceedings of international working conference on reverse engineering. IEEE Computer Society, pp 195–204
Bavota G, De Lucia A, Marcus A, Oliveto R (2013a) Using structural and semantic measures to improve software modularization. Empir Softw Eng 18 (5):901–932
Article Google Scholar
Bavota G, Dit B, Oliveto R, Penta MD, Poshyvanyk D, Lucia AD (2013b) An empirical study on the developers’ perception of software coupling. In: Proceedings of international conference on software engineering. IEEE / ACM, pp 692–701
Bavota G, Gethers M, Oliveto R, Poshyvanyk D, De Lucia A (2014a) Improving software modularization via automated analysis of latent topics and dependencies. ACM Trans Softw Eng Methodol 23(1): 4:1–4:33. doi:10.1145/2559935
Article Google Scholar
Bavota G, Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2014b) Methodbook: Recommending move method refactorings via relational topic models. IEEE Trans Softw Eng 40(7):671–694
Article Google Scholar
Binkley D (2007) Source code analysis: a road map. In: Future of software engineering. IEEE Computer Society, pp 104–119
Bishop C (2006) Pattern recognition and machine learning. Information science and statistics. Springer
Bittencourt RA, Guerrero DDS (2009) Comparison of graph clustering algorithms for recovering software architecture module views. In: Proceedings of the European conference on software maintenance and reengineering. IEEE Computer Society, pp 251–254
Conover WJ (1998) Practical nonparametric statistics, 3rd. Wiley
Corazza A, Di Martino S, Maggio V, Moschitti A, Passerini A, Scanniello G, Silvestri F (2013) Using machine learning and information retrieval techniques to improve software maintainability. In: Eternal systems, communications in computer and information science. Springer, Berlin. In Press
Book Google Scholar
Corazza A, Di Martino S, Maggio V, Scanniello G (2011) Investigating the use of lexical information for software system clustering. In: Proceedings of European conference on software maintenance and reengineering. IEEE Computer Society, pp 35–44
Corazza A, Di Martino S, Scanniello G (2010) A probabilistic based approach towards software system clustering. In: Proceedings of European conference on software maintenance and reengineering. IEEE Computer Society, pp 89–98
De Lucia A, Di Penta M, Oliveto R (2011) Improving source code lexicon via traceability and information retrieval. IEEE Trans Softw Eng 37(2):205–227
Article Google Scholar
De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2012) Using ir methods for labeling source code artifacts: is it worthwhile? In: Proceedings of international conference on program comprehension. IEEE Computer Society Press, pp 193–202
De Lucia A, Risi M, Scanniello G, Tortora G (2009) An investigation of clustering algorithms in the comprehension of legacy web applications. J Web Eng 8(4):346–370
Google Scholar
Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Article Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc Ser B 39 (1):1–38
MathSciNet MATH Google Scholar
van Deursen A, Hofmeister C, Koschke R, Moonen L, Riva C (2004) Symphony: view-driven software architecture reconstruction. In: Proceedings of working conference on software architecture, pp 122–134
Ducasse S, Pollet D (2009) Software architecture reconstruction: a process-oriented taxonomy. IEEE Trans Softw Eng 35(4):573–591. doi:10.1109/TSE.2009.19
Article Google Scholar
Eastwood A (1993) Firm fires shots at legacy systems. Comput Canada 19(2):17
MathSciNet Google Scholar
Erlikh L (2000) Leveraging legacy system dollars for e-business. IT Professional 2:17–23
Article Google Scholar
Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press
Freund RJ, Wilson WJ (2003) Statistical methods, 2nd edn. Academic Press
Grubb P, Takang AA (2003) Software maintenance: concepts and practice, 2nd edn. World Scientific
Jarzabek S (2007) Effective software maintenance and evolution—a reuse-based approach. Auerbach Publ
Kampenes V, Dyba T, Hannay J, Sjoberg I (2006) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11–12):1073–1086
Google Scholar
Kaufman L, Rousseeuw P (1990) Finding groups in data an introduction to cluster analysis. Wiley Interscience
Kevin Freedman JB (1999) Current concepts review - sample size and statistical power in clinical orthopaedic research. J Bone Joint Surg 81:1454–60
Google Scholar
Kitchenham B, Al-Khilidar H, Babar M, Berry M, Cox K, Keung J, Kurniawati F, Staples M, Zhang H, Zhu L (2008) Evaluating guidelines for reporting empirical software engineering studies. Empir Softw Eng 13(1):97–121
Article Google Scholar
Koschke R (2000) Atomic architectural component recovery for program understanding and evolution. Ph.D. thesis, University of Stuttgart
Kuhn A, Ducasse S, Girba T (2005) Enriching reverse engineering with semantic clustering. In: Proceedings of international working conference on reverse engineering. IEEE Computer Society, pp 133–142
Kuhn A, Ducasse S, Gîrba T (2007) Semantic clustering: Identifying topics in source code. Inf Softw Technol 49(3):230–243
Article Google Scholar
Liu Y, Poshyvanyk D, Ferenc R, Gyimȯthy T, Chrisochoides N (2009) Modeling class cohesion as mixtures of latent topics. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 233–242
Mahdavi K (2005) A clustering genetic algorithm for software modularisation with a multiple hill climbing approach. Ph.D. thesis, Department of Information Systems and Computing, Brunel University
Google Scholar
Maletic JI, Marcus A (2001) Supporting program comprehension using semantic and structural information. In: Proceedings of international conference on software engineering. IEEE Computer Society, Washington, pp 103–112
Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Book MATH Google Scholar
Maqbool O, Babri H (2007) Hierarchical clustering for software architecture recovery. IEEE Trans Software Eng 33 (11):759–780
Article Google Scholar
Marcus A, Poshyvanyk D (2005) The conceptual cohesion of classes. In: International conference on software maintenance. IEEE Computer Society, pp 133–142
Mashiko Y, Basili V (1997) Using the GQM paradigm to investigate influential factors for software process improvement. J Syst Softw 36(1):17–32
Article Google Scholar
McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: Proceedings of workshop on learning for text categorization. AAAI Press, pp 41–48
Mclachlan J, Krishnan T (1996) The EM algorithm and extensions. Wiley Inter-science
Mendonça NC, Kramer J (1996) Requirements for an effective architecture recovery framework. In: Joint proceedings of the second international software architecture workshop and international workshop on multiple perspectives in software development. ACM, pp 101–105
Mitchell TM (1997) Machine learning, 1st edn. McGraw-Hill, Inc., New York
MATH Google Scholar
Pfleeger SL, Menezes W (2000) Marketing technology to software practitioners. IEEE Softw 17:27–33
Article Google Scholar
Port O (1998) The software trap – automate or else. Bus Week 9(3051):142–154
Google Scholar
Porter MF (1997) An algorithm for suffix stripping. Morgan Kaufmann Publishers Inc., San Francisco, pp 313–316
Google Scholar
Poshyvanyk D, Marcus A (2006) The conceptual coupling metrics for object-oriented systems. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 469–478
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C, the art of scientific computing, 2nd edn. Cambridge University Press
Reggio G, Ricca F, Scanniello G, Di Cerbo F, Dodero G (2011) A precise style for business process modelling: Results from two controlled experiments. In: Proceedings of model driven engineering languages and systems, lecture notes in computer science. Springer, pp 138–152
Revelle M, Gethers M, Poshyvanyk D (2011) Using structural and textual information to capture feature coupling in object-oriented software. Empir Softw Eng 16(6):773–811
Article Google Scholar
Risi M, Scanniello G, Tortora G (2012) Using fold-in and fold-out in the architecture recovery of software systems. Formal Asp Comput 24(3):307–330
Article Google Scholar
Romano S, Scanniello G, Risi M, Gravino C (2011) Clustering and lexical information support for the recovery of design pattern in source code. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 500–503
Romesburg H (2004) Cluster analysis for researchers. Lulu Press. http://books.google.it/books?id=ZuIPv7OKm10C
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620. doi:10.1145/361219.361220
Article MATH Google Scholar
Saw JG, Yang MCK, Mo TC (1984) Chebyshev inequality with estimated mean and variance. Am Stat 38(2):130–132
MathSciNet Google Scholar
Scanniello G, D’Amico A, D’Amico C, D’Amico T (2010) Using the Kleinberg algorithm and Vector Space Model for software system clustering. In: Proceedings of international conference on program comprehension. IEEE Computer Society, pp 180–189
Scanniello G, Gravino C, Marcus A, Menzies T (2013) Class level fault prediction using software clustering. In: Proceedings of international conference on automated software engineering. IEEE / ACM, pp 640–645
Scanniello G, Marcus A (2011) Clustering support for static concept location in source code. In: Proceedings of international conference on program comprehension. IEEE Computer Society, pp 1–10
Scanniello G, Marcus A, Pascale D (2014) Link analysis algorithms for static concept location: an empirical assessment. Empir Softw Eng 1–55. doi:10.1007/s10664-014-9327-7
Scanniello G, Risi M, Tortora G (2010) Architecture recovery using latent semantic indexing and k-means: an empirical evaluation. In: Proceedings of international conference on software engineering and formal methods. IEEE Computer Society, pp 103–112
Shapiro S, Wilk M (1965) An analysis of variance test for normality. Biometrika 52(3–4):591–611
Article MathSciNet MATH Google Scholar
Shtern M, Tzerpos V (2011) Evaluating software clustering using multiple simulated authoritative decompositions. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 353–361
Tonella P (2001) Concept analysis for module restructuring. IEEE Trans Softw Eng 27(4):351–363. doi:10.1109/32.917524
Article Google Scholar
Tzerpos V, Holt RC (1999) Mojo: A distance metric for software clusterings. In: Proceedings of the working conference of reverse engineering, pp 187–193
Wen Z, Tzerpos V (2004) An effectiveness measure for software clustering algorithms. In: Proceedings of international conference on program comprehension. IEEE Computer Society, pp 194–203
Wiggerts TA (1997) Using clustering algorithms in legacy systems remodularization. In: Proceedings of working conference on reverse engineering. IEEE Computer Society, Washington, pp 33–43
Google Scholar
Wohlin C, Runeson P, Höst M, Ohlsson M, Regnell B, Wesslén A (2000) Experimentation in software engineering - an introduction. Kluwer
Wu J, Hassan AE, Holt RC (2005) Comparison of clustering algorithms in the context of software evolution. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 525–535

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering and Information Technologies, University of Naples Federico II, Napoli, Italy
Anna Corazza & Sergio Di Martino
Department of Information and Electrical Engineering and Applied Mathematics, University of Salerno, Fisciano, Salerno, Italy
Valerio Maggio
Dipartimento di Matematica, Informatica e Economia, University of Basilicata, Potenza, Italy
Giuseppe Scanniello

Authors

Anna Corazza
View author publications
You can also search for this author inPubMed Google Scholar
Sergio Di Martino
View author publications
You can also search for this author inPubMed Google Scholar
Valerio Maggio
View author publications
You can also search for this author inPubMed Google Scholar
Giuseppe Scanniello
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Valerio Maggio.

Additional information

Communicated by: Thomas Zimmermann

Rights and permissions

Reprints and permissions

About this article

Cite this article

Corazza, A., Di Martino, S., Maggio, V. et al. Weighing lexical information for software clustering in the context of architecture recovery. Empir Software Eng 21, 72–103 (2016). https://doi.org/10.1007/s10664-014-9347-3

Download citation

Published: 21 March 2015
Issue Date: February 2016
DOI: https://doi.org/10.1007/s10664-014-9347-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weighing lexical information for software clustering in the context of architecture recovery

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Evaluating the Effectiveness of Multi-level Greedy Modularity Clustering for Software Architecture Recovery

Detecting, classifying, and tracing non-functional software requirements

Reconstructing and evolving software architectures using a coordinated clustering framework

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now