Skip to main content
Log in

Automated training-set creation for software architecture traceability problem

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Automated trace retrieval methods based on machine-learning algorithms can significantly reduce the cost and effort needed to create and maintain traceability links between requirements, architecture and source code. However, there is always an upfront cost to train such algorithms to detect relevant architectural information for each quality attribute in the code. In practice, training supervised or semi-supervised algorithms requires the expert to collect several files of architectural tactics that implement a quality requirement and train a learning method. Establishing such a training set can take weeks to months to complete. Furthermore, the effectiveness of this approach is largely dependent upon the knowledge of the expert. In this paper, we present three baseline approaches for the creation of training data. These approaches are (i) Manual Expert-Based, (ii) Automated Web-Mining, which generates training sets by automatically mining tactic’s APIs from technical programming websites, and lastly (iii) Automated Big-Data Analysis, which mines ultra-large scale code repositories to generate training sets. We compare the trace-link creation accuracy achieved using each of these three baseline approaches and discuss the costs and benefits associated with them. Additionally, in a separate study, we investigate the impact of training set size on the accuracy of recovering trace links. The results indicate that automated techniques can create a reliable training set for the problem of tracing architectural tactics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://design.se.rit.edu/budget/

  2. https://msdn.microsoft.com

  3. http://www.oracle.com

  4. http://ghtorrent.org/

  5. http://design.se.rit.edu/budget/

  6. http://www.codeproject.com

  7. http://coest.org/mt/27/150

  8. Weka’s NaiveBayesMultinomialText method was used.

  9. Please see terms in the figures: http://www.1tech.eu/clients/casestudy_ventraq

References

  • Anish PR, Balasubramaniam B, Cleland-Huang J, Wieringa R, Daneva M, Ghaisas S (2015) Identifying architecturally significant functional requirements. In: Proceedings of the Fifth International Workshop on Twin Peaks of Requirements and Architecture, TwinPeaks ’15. IEEE Press, NJ, USA, pp 3–8

  • California Senate Bill SB 1386 (2002) http://www.leginfo.ca.gov/pub/13-14/bill/sen/sb_1351-1400/sb_1351_bill_20140221_introduced.pdf

  • Congress US (1999) Gramm-Leach-Bliley Act, Financial Privacy Rule. 15 USC:6801–6809. http://www.law.cornell.edu/uscode/usc_sup_01_15_10_94_20_I.html

  • Council PCI, Payment card industry (pci) data security standard Available over the Internet (July 2010). https://www.pcisecuritystandards.org

  • Bachmann F, Bass L, Klein M (2003) Deriving Architectural Tactics: Architectural A Step Toward Methodical Architectural Design. Technical Report, Software Engineering Institute

  • Bass L, Clements P, Kazman R (2003) Software Architecture in Practice. Adison Wesley

  • Beeler GW Jr, Gardner D (2006) A requirements primer. Queue 4(7):22–26

    Article  Google Scholar 

  • Brodley CE (1993) Addressing the selective superiority problem: Automatic algorithm/model class selection

  • Cano JR, Herrera F, Lozano M (2003) Using evolutionary algorithms as instance selection for data reduction in kdd An experimental study. Trans Evol Comp 7(6):561–575

    Article  Google Scholar 

  • Cleland-Huang J, Czauderna A, Gibiec M, Emenecker J (2010) A machine learning approach for tracing regulatory codes to product specific requirements. In: ICSE (1), pp 155–164

  • Cleland-Huang J, Gotel O, Huffman Hayes J, Mader P, Zisman A (2014) Software traceability: Trends and future directions. In: Proceedings of the 36th International Conference on Software Engineering (ICSE), India

  • Cleland-Huang J, Settimi R, Zou X, Solc P (2007) Automated detection and classification of non-functional requirements. Requir Eng 12(2):103–120

    Article  Google Scholar 

  • Dyer R, Rajan H, Nguyen HA, Nguyen TN (2014) Mining billions of ast nodes to study actual and potential usage of java language features. In: Proceedings of the 36th International Conference on Software Engineering, ICSE 2014. ACM, NY, USA, pp 779–790

  • Gates G (1972) The reduced nearest neighbor rule (corresp). IEEE Trans Inf Theory 18(3):431–433

    Article  Google Scholar 

  • Gethers M, Oliveto R, Poshyvanyk D, Lucia A (2011) On integrating orthogonal information retrieval methods to improve traceability recovery. In: 2011 27th IEEE International Conference on Software Maintenance (ICSM), pp 133–142

  • Gibiec M, Czauderna A, Cleland-Huang J (2010) Towards mining replacement queries for hard-to-retrieve traces. In: Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, ASE ’10. ACM, NY, USA, pp 245–254

  • Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Morgan Kaufmann

  • Koders (2014) http://www.koders.com

  • Liebchen GA, Shepperd M (2008) Data sets and data quality in software engineering. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, PROMISE’08. ACM, NY, USA, pp 39–44

  • Mahmoud A (2015) An information theoretic approach for extracting and tracing non-functional requirements. In: Proceedings RE. IEEE, pp 36–45

  • McCandless M, Hatcher E, Gospodnetic O (2010) Lucene in Action, 2nd edn. Covers Apache Lucene 3.0. Manning Publications Co, CT, USA

    Google Scholar 

  • Mehdi Mirakhorli J. C.-H. (2015) Detecting, tracing, and monitoring architectural tactics in code. IEEE Trans Software Eng

  • Mirakhorli M (2014) Preserving the quality of architectural decisions in source code. PhD Dissertation, DePaul University Library

  • Mirakhorli M, Cleland-Huang J (2011) Tracing Non-Functional Requirements. In: Zisman A, Cleland-Huang J, Gotel O (eds) Software and Systems Traceability. Springer-Verlag

  • Mirakhorli M, Cleland-Huang J (2011) Using tactic traceability information models to reduce the risk of architectural degradation during system maintenance. In: Proceedings of the 2011 27th IEEE International Conference on Software Maintenance, ICSM ’11. IEEE Computer Society, DC, USA, pp 123–132

  • Mirakhorli M, Fakhry A, Grechko A, Wieloch M, Cleland-Huang J (2014) Archie: A tool for detecting, monitoring, and preserving architecturally significant code. In: CM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE 2014)

  • Mirakhorli M, Mäder P., Cleland-Huang J (2012) Variability points and design pattern usage in architectural tactics. In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, FSE ’12. ACM, pp 52:1–52:11

  • Mirakhorli M, Shin Y, Cleland-Huang J, Cinar M (2012) A tactic centric approach for automating traceability of quality concerns. In: International Conference on Software Engineering, ICSE (1)

  • Molina LC, Belanche L, Nebot À (2002) Feature Selection Algorithms: A Survey and Experimental Evaluation. In: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), 9–12 December 2002, Maebashi City, Japan. doi:10.1109/ICDM.2002.1183917, 10.1109/ICDM.2002.1183917, pp 306–313

  • Passini MLC, Estb̆anez K. B., Figueredo GP, Ebecken NFF (2013) A strategy for training set selection in text classification problems. (IJACSA) International Journal of Advanced Computer Science and Applications 4(6):54–60

    Google Scholar 

  • Salton G (1989) Automatic text processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., MA, USA

    Google Scholar 

  • Skalak DB (1994) Prototype and feature selection by sampling and random mutation hill climbing algorithms. In: Machine Learning: Proceedings of the Eleventh International Conference. Morgan Kaufmann, pp 293–301

  • University of California I (2010) The sourcerer project. sourcerer.ics.uci.edu

  • De Winter JCF (2013) Using the Student’s t-test with extremely small sample sizes

  • Wilson DR, Martinez TR (2000) Reduction techniques for instance-basedlearning algorithms. Mach Learn 38(3):257–286

    Article  MATH  Google Scholar 

  • Zhu J, Zhou M, Mockus A (2014) Patterns of folder use and project popularity: A case study of github repositories. In: Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement ESEM ’14, vol 4, pp 30:1–30:4

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mehdi Mirakhorli.

Additional information

Communicated by: Patrick Mäder, Rocco Oliveto and Andrian Marcus

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zogaan, W., Mujhid, I., S. Santos, J.C. et al. Automated training-set creation for software architecture traceability problem. Empir Software Eng 22, 1028–1062 (2017). https://doi.org/10.1007/s10664-016-9476-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-016-9476-y

Keywords

Navigation