Automated training-set creation for software architecture traceability problem

Zogaan, Waleed; Mujhid, Ibrahim; S. Santos, Joanna C.; Gonzalez, Danielle; Mirakhorli, Mehdi

doi:10.1007/s10664-016-9476-y

Automated training-set creation for software architecture traceability problem

Published: 10 November 2016

Volume 22, pages 1028–1062, (2017)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Waleed Zogaan¹,
Ibrahim Mujhid¹,
Joanna C. S. Santos¹,
Danielle Gonzalez¹ &
…
Mehdi Mirakhorli¹

1048 Accesses
4 Citations
3 Altmetric
Explore all metrics

Abstract

Automated trace retrieval methods based on machine-learning algorithms can significantly reduce the cost and effort needed to create and maintain traceability links between requirements, architecture and source code. However, there is always an upfront cost to train such algorithms to detect relevant architectural information for each quality attribute in the code. In practice, training supervised or semi-supervised algorithms requires the expert to collect several files of architectural tactics that implement a quality requirement and train a learning method. Establishing such a training set can take weeks to months to complete. Furthermore, the effectiveness of this approach is largely dependent upon the knowledge of the expert. In this paper, we present three baseline approaches for the creation of training data. These approaches are (i) Manual Expert-Based, (ii) Automated Web-Mining, which generates training sets by automatically mining tactic’s APIs from technical programming websites, and lastly (iii) Automated Big-Data Analysis, which mines ultra-large scale code repositories to generate training sets. We compare the trace-link creation accuracy achieved using each of these three baseline approaches and discuss the costs and benefits associated with them. Additionally, in a separate study, we investigate the impact of training set size on the accuracy of recovering trace links. The results indicate that automated techniques can create a reliable training set for the problem of tracing architectural tactics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data collection and quality challenges in deep learning: a data-centric AI perspective

Article 03 January 2023

Steven Euijong Whang, Yuji Roh, … Jae-Gil Lee

On the assessment of generative AI in modeling tasks: an experience report with ChatGPT and UML

Article Open access 22 May 2023

Javier Cámara, Javier Troya, … Antonio Vallecillo

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

Notes

http://design.se.rit.edu/budget/
https://msdn.microsoft.com
http://www.oracle.com
http://ghtorrent.org/
http://design.se.rit.edu/budget/
http://www.codeproject.com
http://coest.org/mt/27/150
Weka’s NaiveBayesMultinomialText method was used.
Please see terms in the figures: http://www.1tech.eu/clients/casestudy_ventraq

References

Anish PR, Balasubramaniam B, Cleland-Huang J, Wieringa R, Daneva M, Ghaisas S (2015) Identifying architecturally significant functional requirements. In: Proceedings of the Fifth International Workshop on Twin Peaks of Requirements and Architecture, TwinPeaks ’15. IEEE Press, NJ, USA, pp 3–8
California Senate Bill SB 1386 (2002) http://www.leginfo.ca.gov/pub/13-14/bill/sen/sb_1351-1400/sb_1351_bill_20140221_introduced.pdf
Congress US (1999) Gramm-Leach-Bliley Act, Financial Privacy Rule. 15 USC:6801–6809. http://www.law.cornell.edu/uscode/usc_sup_01_15_10_94_20_I.html
Council PCI, Payment card industry (pci) data security standard Available over the Internet (July 2010). https://www.pcisecuritystandards.org
Bachmann F, Bass L, Klein M (2003) Deriving Architectural Tactics: Architectural A Step Toward Methodical Architectural Design. Technical Report, Software Engineering Institute
Bass L, Clements P, Kazman R (2003) Software Architecture in Practice. Adison Wesley
Beeler GW Jr, Gardner D (2006) A requirements primer. Queue 4(7):22–26
Article Google Scholar
Brodley CE (1993) Addressing the selective superiority problem: Automatic algorithm/model class selection
Cano JR, Herrera F, Lozano M (2003) Using evolutionary algorithms as instance selection for data reduction in kdd An experimental study. Trans Evol Comp 7(6):561–575
Article Google Scholar
Cleland-Huang J, Czauderna A, Gibiec M, Emenecker J (2010) A machine learning approach for tracing regulatory codes to product specific requirements. In: ICSE (1), pp 155–164
Cleland-Huang J, Gotel O, Huffman Hayes J, Mader P, Zisman A (2014) Software traceability: Trends and future directions. In: Proceedings of the 36th International Conference on Software Engineering (ICSE), India
Cleland-Huang J, Settimi R, Zou X, Solc P (2007) Automated detection and classification of non-functional requirements. Requir Eng 12(2):103–120
Article Google Scholar
Dyer R, Rajan H, Nguyen HA, Nguyen TN (2014) Mining billions of ast nodes to study actual and potential usage of java language features. In: Proceedings of the 36th International Conference on Software Engineering, ICSE 2014. ACM, NY, USA, pp 779–790
Gates G (1972) The reduced nearest neighbor rule (corresp). IEEE Trans Inf Theory 18(3):431–433
Article Google Scholar
Gethers M, Oliveto R, Poshyvanyk D, Lucia A (2011) On integrating orthogonal information retrieval methods to improve traceability recovery. In: 2011 27th IEEE International Conference on Software Maintenance (ICSM), pp 133–142
Gibiec M, Czauderna A, Cleland-Huang J (2010) Towards mining replacement queries for hard-to-retrieve traces. In: Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, ASE ’10. ACM, NY, USA, pp 245–254
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Morgan Kaufmann
Koders (2014) http://www.koders.com
Liebchen GA, Shepperd M (2008) Data sets and data quality in software engineering. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, PROMISE’08. ACM, NY, USA, pp 39–44
Mahmoud A (2015) An information theoretic approach for extracting and tracing non-functional requirements. In: Proceedings RE. IEEE, pp 36–45
McCandless M, Hatcher E, Gospodnetic O (2010) Lucene in Action, 2nd edn. Covers Apache Lucene 3.0. Manning Publications Co, CT, USA
Google Scholar
Mehdi Mirakhorli J. C.-H. (2015) Detecting, tracing, and monitoring architectural tactics in code. IEEE Trans Software Eng
Mirakhorli M (2014) Preserving the quality of architectural decisions in source code. PhD Dissertation, DePaul University Library
Mirakhorli M, Cleland-Huang J (2011) Tracing Non-Functional Requirements. In: Zisman A, Cleland-Huang J, Gotel O (eds) Software and Systems Traceability. Springer-Verlag
Mirakhorli M, Cleland-Huang J (2011) Using tactic traceability information models to reduce the risk of architectural degradation during system maintenance. In: Proceedings of the 2011 27th IEEE International Conference on Software Maintenance, ICSM ’11. IEEE Computer Society, DC, USA, pp 123–132
Mirakhorli M, Fakhry A, Grechko A, Wieloch M, Cleland-Huang J (2014) Archie: A tool for detecting, monitoring, and preserving architecturally significant code. In: CM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE 2014)
Mirakhorli M, Mäder P., Cleland-Huang J (2012) Variability points and design pattern usage in architectural tactics. In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, FSE ’12. ACM, pp 52:1–52:11
Mirakhorli M, Shin Y, Cleland-Huang J, Cinar M (2012) A tactic centric approach for automating traceability of quality concerns. In: International Conference on Software Engineering, ICSE (1)
Molina LC, Belanche L, Nebot À (2002) Feature Selection Algorithms: A Survey and Experimental Evaluation. In: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), 9–12 December 2002, Maebashi City, Japan. doi:10.1109/ICDM.2002.1183917, 10.1109/ICDM.2002.1183917, pp 306–313
Passini MLC, Estb̆anez K. B., Figueredo GP, Ebecken NFF (2013) A strategy for training set selection in text classification problems. (IJACSA) International Journal of Advanced Computer Science and Applications 4(6):54–60
Google Scholar
Salton G (1989) Automatic text processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., MA, USA
Google Scholar
Skalak DB (1994) Prototype and feature selection by sampling and random mutation hill climbing algorithms. In: Machine Learning: Proceedings of the Eleventh International Conference. Morgan Kaufmann, pp 293–301
University of California I (2010) The sourcerer project. sourcerer.ics.uci.edu
De Winter JCF (2013) Using the Student’s t-test with extremely small sample sizes
Wilson DR, Martinez TR (2000) Reduction techniques for instance-basedlearning algorithms. Mach Learn 38(3):257–286
Article MATH Google Scholar
Zhu J, Zhou M, Mockus A (2014) Patterns of folder use and project popularity: A case study of github repositories. In: Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement ESEM ’14, vol 4, pp 30:1–30:4

Download references

Author information

Authors and Affiliations

Software Engineering Department, Rochester Institute of Technology, Rochester, NY, USA
Waleed Zogaan, Ibrahim Mujhid, Joanna C. S. Santos, Danielle Gonzalez & Mehdi Mirakhorli

Authors

Waleed Zogaan
View author publications
You can also search for this author in PubMed Google Scholar
Ibrahim Mujhid
View author publications
You can also search for this author in PubMed Google Scholar
Joanna C. S. Santos
View author publications
You can also search for this author in PubMed Google Scholar
Danielle Gonzalez
View author publications
You can also search for this author in PubMed Google Scholar
Mehdi Mirakhorli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mehdi Mirakhorli.

Additional information

Communicated by: Patrick Mäder, Rocco Oliveto and Andrian Marcus

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zogaan, W., Mujhid, I., S. Santos, J.C. et al. Automated training-set creation for software architecture traceability problem. Empir Software Eng 22, 1028–1062 (2017). https://doi.org/10.1007/s10664-016-9476-y

Download citation

Published: 10 November 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s10664-016-9476-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automated training-set creation for software architecture traceability problem

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

On the assessment of generative AI in modeling tasks: an experience report with ChatGPT and UML

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automated training-set creation for software architecture traceability problem

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

On the assessment of generative AI in modeling tasks: an experience report with ChatGPT and UML

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation