Skip to main content
Log in

Information retrieval versus deep learning approaches for generating traceability links in bilingual projects

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Software traceability links are established between diverse artifacts of the software development process in order to support tasks such as compliance analysis, safety assurance, and requirements validation. However, practice has shown that it is difficult and costly to create and maintain trace links in non-trivially sized projects. For this reason, many researchers have proposed and evaluated automated approaches based on information retrieval and deep-learning. Generating trace links automatically can also be challenging – especially in multi-national projects which include artifacts written in multiple languages. The intermingled language use can reduce the efficiency of automated tracing solutions. In this work, we analyze patterns of intermingled language that we observed in several different projects, and then comparatively evaluate different tracing algorithms. These include Information Retrieval techniques, such as the Vector Space Model (VSM), Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), and various models that combine mono- and cross-lingual word embeddings with the Generative Vector Space Model (GVSM), and a deep-learning approach based on a BERT language model. Our experimental analysis of trace links generated for 14 Chinese-English projects indicates that our MultiLingual Trace-BERT approach performed best in large projects with close to 2-times the accuracy of the best IR approach, while the IR-based GVSM with neural machine translation and a monolingual word embedding performed best on small projects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Our dataset can be found at https://doi.org/10.5281/zenodo.3713256

  2. Repository for MT-BERT: https://github.com/jinfenglin/EMSE2020

References

  • EF EPI (2019) EF English Proficiency Index

  • Fasttext (2021) Word vectors for 157 languages ⋅ fasttext

  • Double Blinded (2020) All information is blinded due to current submission under double blind review. the paper is available upon request to the associate editors of the msr emse special edition

  • Abufardeh S, Magel K (2010) The impact of global software cultural and linguistic aspects on global software development process (gsd): Issues and challenges. In: 4th International conference on new trends in information science and service science. pp 133–138

  • Ali N, Guéhéneuc Y, Antoniol G (2013) Trustrace: Mining software repositories to improve the accuracy of requirement traceability links. IEEE Trans Softw Eng 39(5):725–741

    Article  Google Scholar 

  • Almasri M, Berrut C, Chevallet J (2016) A comparison of deep learning based query expansion with pseudo-relevance feedback and mutual information. In: Advances in information retrieval - 38th European conference on IR research, ECIR 2016, Padua, Italy, March 20-23, 2016. Proceedings. pp 709–715

  • Antoniol G, Canfora G, Casazza G, Lucia AD, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Software Eng 28(10):970–983

    Article  Google Scholar 

  • Asuncion HU, Asuncion A, Taylor RN (2010) Software traceability with topic modeling. In: 32nd ACM/IEEE International conference on software engineering (ICSE). pp 95–104

  • Asuncion HU, Taylor RN (2012) Automated techniques for capturing custom traceability links across heterogeneous artifacts. In: Software and systems traceability. pp 129–146

  • Bird S (2006) NLTK: the natural language toolkit. In: ACL 2006, 21st International conference on computational linguistics and 44th annual meeting of the association for computational linguistics, Proceedings of the Conference, Sydney, Australia, 17-21 July 2006

  • Calefato F, Lanubile F, P Minervini and (2010) Can real-time machine translation overcome language barriers in distributed requirements engineering?. In: 2010 5th IEEE International conference on global software engineering. IEEE, pp 257–264

  • Calefato F, Lanubile F, Prikladnicki R (2011) A controlled experiment on the effects of machine translation in multilingual requirements meetings. In: 6th IEEE International conference on global software engineering, ICGSE 2011, Helsinki, Finland, August 15-18, 2011. pp 94–102

  • Cleland-Huang J, Czauderna A, Dekhtyar A, Gotel O, Hayes JH, Keenan E, Leach G, Maletic JI, Poshyvanyk D, Shin Y, Zisman A, Antoniol G, Berenbach B, Egyed A, Mȧder P (2011) Grand challenges, benchmarks, and tracelab: developing infrastructure for the software traceability research community. In: TEFSE’11, Proceedings of the 6th International workshop on traceability in emerging forms of software engineering, May 23, 2011, Waikiki, Honolulu, HI, USA. pp 17–23

  • Cleland-Huang J, Gotel O, Hayes JH, Mäder P, Zisman A (2014) Software traceability: trends and future directions. In: FOSE. pp 55–69

  • Cleland-Huang J, Rahimi M, Mȧder P (2014) Achieving lightweight trustworthy traceability. In: Proceedings of the 22nd ACM SIGSOFT International symposium on foundations of software engineering, (FSE-22), Hong Kong, China, November 16 - 22, 2014. pp 849–852

  • Conneau A, Lample G, Ranzato M, Denoyer L, Jégou H. (2017) Word translation without parallel data. arXiv:1710.04087

  • Conneau A, Lample G, Rinott R, Williams A, Bowman SR, Schwenk H, Stoyanov V (2018) Xnli: Evaluating cross-lingual sentence representations. arXiv:1809.05053

  • Cover TM, Thomas JA (2006) Elements of information theory (Wiley series in telecommunications and signal processing). Wiley-Interscience, New York

    Google Scholar 

  • Cruz BD, Jayaraman B, Dwarakanath A, McMillan C (2017) Detecting vague words & phrases in requirements documents in a multilingual environment. In: 2017 IEEE 25th International requirements engineering conference (RE). pp 233–242. IEEE

  • Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding arXiv:1810.04805

  • Dhingra B, Zhou Z, Fitzpatrick D, Muehl M, Cohen WW (2016) Tweet2vec: Character-based distributed representations for social media. In: Proceedings of the 54th annual meeting of the association for computational linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers

  • Fu Y (2021) Who offers the best chinese-english machine translation? a comparison of google, microsoft bing, baidu, tencent, sogou, and netease youdao

  • Google-Research (2019) Github Repository: Multilingual Models google-research/bert

  • Gotel O, Cleland-Huang J, Huffman Hayes J, Zisman A, Egyed A, Grünbacher P., Antoniol G (2012) The quest for ubiquity: A roadmap for software and systems traceability research. In: 21st IEEE International requirements engineering conference (RE). pp 71–80

  • Gotel OCZ, Finkelstein A (1994) An analysis of the requirements traceability problem. In: Proceedings of the first IEEE international conference on requirements engineering, ICRE ’94, Colorado Springs, Colorado, USA, April 18-21, 1994. pp 94–101

  • Gouws S, Bengio Y, Corrado G (2015) Bilbowa: Fast bilingual distributed representations without word alignments. In: Proceedings of the 32nd International conference on machine learning, ICML 2015, Lille, France, 6-11 July 2015. pp 748–756

  • Guo J, Cheng J, Cleland-Huang J (2017) Semantically enhanced software traceability using deep learning techniques. In: Proceedings of the 39th international conference on software engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017. pp 3–14

  • Guo J, Cleland-Huang J, Berenbach B (2013) Foundations for an expert system in domain-specific traceability. In: 21st IEEE International requirements engineering conference, RE 2013, Rio de Janeiro-RJ, Brazil, July 15-19, 2013. IEEE Computer Society, pp 42–5

  • Harris Z (1954) Distributional structure. Word 10(23):146–162

    Article  Google Scholar 

  • Hayes JH, Dekhtyar A, Sundaram SK (2006) Advancing candidate link generation for requirements tracing: The study of methods. IEEE Trans Software Eng 32(1):4–19

    Article  Google Scholar 

  • Hilgert L, Lopes L, Freitas A, Vieira R, Hogetop D, Vanim A (2014) Building domain specific bilingual dictionaries. In: Proceedings of the ninth international conference on language resources and evaluation (LREC’14), 2014, Islândia

  • Hofmann T (1999) Probabilistic latent semantic indexing. In: SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, August 15-19, 1999, Berkeley, CA, USA. pp 50–57

  • Jenkins J (1999) New ideographs in unicode 3.0 and beyond. In: Proceedings of the 15th international unicode conference C, vol 15. pp 1–2

  • Johnson M, Schuster M, Le QV, Krikun M, Wu Y, Chen Z, Thorat N, Viégas F, Wattenberg M, Corrado G et al (2017) Google’s multilingual neural machine translation system: Enabling zero-shot translation. Trans Assoc Comput Linguist 5:339–351

  • Jones E, Oliphant T, Peterson P et al (2001) SciPy: Open source scientific tools for Python. [Online; accessed < today >]

  • Joulin A, Bojanowski P, Mikolov T, Jégou H., Grave E (2018) Loss in translation: Learning bilingual word mapping with a retrieval criterion. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, October 31 - November 4, 2018. pp 2979–2984

  • Kailath T (1967) The divergence and bhattacharyya distance measures in signal selection. IEEE Trans Commun Technol 15(1):52–60

    Article  Google Scholar 

  • Khandkar SH (2009) Open coding. University of Calgary, 23:2009

  • Krishna S, Sahay S, Walsham G (2004) Managing cross-cultural issues in global software outsourcing. Commun ACM 47(4):62–66

    Article  Google Scholar 

  • Liu Y, Lin J, Cleland-Huang J (2020) Traceability support for multi-lingual software projects. In: Kim S, Gousios G, Nadi S, Hejderup J (eds) MSR ’20: 17th International conference on mining software repositories, Seoul, Republic of Korea, 29-30 June, 2020. ACM, pp 443–454

  • Liu Y, Lin J, Zeng Q, Jiang M, Cleland-Huang J (2020) Towards semantically guided traceability. In: 2020 IEEE 28th International requirements engineering conference (RE). pp 328–333. IEEE

  • Lohar S, Amornborvornwong S, Zisman A, Cleland-Huang J (2013) Improving trace accuracy through data-driven configuration and composition of tracing features. In: Joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC/FSE’13, Saint Petersburg, Russian Federation, August 18-26, 2013. pp 378–388

  • Lormans M, Van Deursen A (2006) Can lsi help reconstructing requirements traceability in design and test?. In: Conference on software maintenance and reengineering (CSMR’06). IEEE, pp 10–pp

  • Lucia AD, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans Softw Eng Methodol 16(4)

  • Lutz B (2009) Linguistic challenges in global software development: Lessons learned in an international SW development division. In: 4th IEEE International conference on global software engineering, ICGSE 2009, Limerick, Ireland, 13-16 July, 2009. pp 249–253

  • Mȧder P, Gotel O (2012) Towards automated traceability maintenance. J Syst Softw 85(10):2205–2227

    Article  Google Scholar 

  • Meeker M, Wu L (2018) Internet trends 2018

  • Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781

  • Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A (2018) Advances in pre-training distributed word representations. In: Proceedings of the International conference on language resources and evaluation (LREC 2018)

  • Monti J, Monteleone M, Di Buono MP, Marano F (2013) Natural language processing and big data-an ontology-based approach for cross-lingual information retrieval. In: 2013 International conference on social computing. IEEE, pp 725–731

  • Moulin C, Sugawara K, Fujita S, Wouters L, Manabe Y (2009) Multilingual collaborative design support system. In: Proceedings of the 13th International conference on computers supported cooperative work in design, CSCWD 2009, April 22-24, 2009, Santiago, Chile. pp 312–318

  • Muhr M, Kern R, Zechner M, Granitzer M (2010) External and intrinsic plagiarism detection using a cross-lingual retrieval and segmentation system. In: Notebook papers of CLEF 2010 LABs and workshops

  • Oliveto R, Gethers M, Poshyvanyk D, Lucia AD (2010) On the equivalence of information retrieval methods for automated traceability link recovery. In: The 18th IEEE International conference on program comprehension, ICPC 2010, Braga, Minho, Portugal, June 30-July 2, 2010. pp 68–71

  • Pawelka T, Juergens E (2015) Is this code written in english? a study of the natural language of comments and identifiers in practice. In: 2015 IEEE International conference on software maintenance and evolution (ICSME). IEEE, pp 401–410

  • Rath M, Rendall J, Guo JLC, Cleland-Huang J, Mȧder P (2018) Traceability in the wild: automatically augmenting incomplete trace links. In: Proceedings of the 40th international conference on software engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018. pp 834–845

  • Rempel P, Mäder P, Kuschke T, Cleland-Huang J (2015) Traceability gap analysis for assessing the conformance of software traceability to relevant guidelines. In: Software engineering & management 2015, Multikonferenz der GI-Fachbereiche Softwaretechnik (SWT) und Wirtschaftsinformatik, Dresden, Germany. pp 120–121

  • Ruder S, Vuli’c I, Sogaard A (2017) A survey of cross-lingual word embedding models

  • Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv:1910.01108

  • Shin Y, Hayes JH, Cleland-Huang J (2015) Guidelines for benchmarking automated software traceability techniques. In: 8th IEEE/ACM International symposium on software and systems traceability, SST 2015, Florence, Italy, May 17, 2015. pp 61–67

  • Spanoudakis G, Zisman A, Pérez-Miñana E, Krause P (2004) Rule-based generation of requirements traceability relations. J Syst Softw 72(2):105–127

    Article  Google Scholar 

  • Tang G, Xia Y, Zhang M, Li H, Zheng F (2011) CLGVSM: adapting generalized vector space model to cross-lingual document clustering. In: Fifth International joint conference on natural language processing, IJCNLP 2011, Chiang Mai, Thailand, November 8-13, 2011. pp 580–588

  • Trec-Kba, trec-kba/many-stop-words (2021)

  • Treude C, Prolo CA, Figueira Filho F (2015) Challenges in analyzing software documentation in portuguese. In: 2015 29th Brazilian symposium on software engineering. IEEE, pp 179–184

  • Tsatsaronis G, Panagiotopoulou V (2009) A generalized vector space model for text retrieval based on semantic relatedness. In: EACL 2009, 12th conference of the european chapter of the association for computational linguistics, Proceedings of the Conference, Athens, Greece, March 30 - April 3, 2009. pp 70–78

  • Tsatsaronis G, Varlamis I, Vazirgiannis M (2010) Text relatedness based on a word thesaurus. J Artif Intell Res 37:1–39

    Article  Google Scholar 

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems. pp 5998–6008

  • Vulic I (2017) Cross-lingual syntactically informed distributed word representations. In: Proceedings of the 15th conference of the european chapter of the association for computational linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers. pp 408–414

  • Vulic I, Moens M (2015) Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, Santiago, Chile, August 9-13, 2015. pp 363–372

  • Wada T, Iwata T (2018) Unsupervised cross-lingual word embedding by multilingual neural language models. arXiv:1809.02306

  • Wong SKM, Ziarko W, Raghavan VV, Wong PCN (1989) Extended boolean query processing in the generalized vector space model. Inf Syst 14(1):47–63

    Article  Google Scholar 

  • Wong SKM, Ziarko W, Wong PCN (1985) Generalized vector space model in information retrieval. In: Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval, Montréal, Québec, Canada, June 5-7, 1985. pp 18–25

  • Woolson R (2007) Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials. pp 1–3

  • Wouters L, Kaeri Y, Sugawara K (2013) Multi-domain multi-lingual collaborative design. In: Proceedings of the 2013 IEEE 17th International conference on computer supported cooperative work in design (CSCWD), IEEE, pp 269–274

  • Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144

  • Xia X, Lo D, Wang X, Zhang C, Wang X (2014) Cross-language bug localization. In: Proceedings of the 22nd International conference on program comprehension. pp 275–278

  • Xu B, Xing Z, Xia X, Lo D, Li S (2018) Domain-specific cross-language relevant question retrieval. Empir Softw Eng 23(2):1084–1122

    Article  Google Scholar 

  • Ye X, Qi Z, Massey D (2015) Learning relevance from click data via neural network based similarity models. In: 2015 IEEE International conference on big data, Big Data 2015, Santa Clara, CA. pp 801–806

  • Zhao T, Cao Q, Sun Q (2017) An improved approach to traceability recovery based on word embeddings. In: 24th Asia-pacific software engineering conference, APSEC 2017, Nanjing, China, December 4-8, 2017. pp 81–89

Download references

Acknowledgements

The work described in this paper has been partially funded by United States National Science Foundation grants CCF-1649448 and SHF-1901059.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinfeng Lin.

Additional information

Communicated by: Georgios Gousios and Sarah Nadi

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Mining Software Repositories (MSR)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, J., Liu, Y. & Cleland-Huang, J. Information retrieval versus deep learning approaches for generating traceability links in bilingual projects. Empir Software Eng 27, 5 (2022). https://doi.org/10.1007/s10664-021-10050-0

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-021-10050-0

Keywords

Navigation