Information retrieval versus deep learning approaches for generating traceability links in bilingual projects

Lin, Jinfeng; Liu, Yalin; Cleland-Huang, Jane

doi:10.1007/s10664-021-10050-0

Information retrieval versus deep learning approaches for generating traceability links in bilingual projects

Published: 22 October 2021

Volume 27, article number 5, (2022)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Jinfeng Lin¹,
Yalin Liu¹ &
Jane Cleland-Huang¹

887 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Software traceability links are established between diverse artifacts of the software development process in order to support tasks such as compliance analysis, safety assurance, and requirements validation. However, practice has shown that it is difficult and costly to create and maintain trace links in non-trivially sized projects. For this reason, many researchers have proposed and evaluated automated approaches based on information retrieval and deep-learning. Generating trace links automatically can also be challenging – especially in multi-national projects which include artifacts written in multiple languages. The intermingled language use can reduce the efficiency of automated tracing solutions. In this work, we analyze patterns of intermingled language that we observed in several different projects, and then comparatively evaluate different tracing algorithms. These include Information Retrieval techniques, such as the Vector Space Model (VSM), Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), and various models that combine mono- and cross-lingual word embeddings with the Generative Vector Space Model (GVSM), and a deep-learning approach based on a BERT language model. Our experimental analysis of trace links generated for 14 Chinese-English projects indicates that our MultiLingual Trace-BERT approach performed best in large projects with close to 2-times the accuracy of the best IR approach, while the IR-based GVSM with neural machine translation and a monolingual word embedding performed best on small projects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Traceability recovery between bug reports and test cases-a Mozilla Firefox case study

Article 07 July 2021

Guilherme Gadelha, Franklin Ramalho & Tiago Massoni

On the relationship between similar requirements and similar software

Article Open access 18 January 2022

Muhammad Abbas, Alessio Ferrari, … Daniel Sundmark

Experimenting with information retrieval methods in the recovery of feature-code SPL traces

Article 10 November 2018

Tassio Vale & Eduardo Santana de Almeida

Notes

Our dataset can be found at https://doi.org/10.5281/zenodo.3713256
Repository for MT-BERT: https://github.com/jinfenglin/EMSE2020

References

EF EPI (2019) EF English Proficiency Index
Fasttext (2021) Word vectors for 157 languages ⋅ fasttext
Double Blinded (2020) All information is blinded due to current submission under double blind review. the paper is available upon request to the associate editors of the msr emse special edition
Abufardeh S, Magel K (2010) The impact of global software cultural and linguistic aspects on global software development process (gsd): Issues and challenges. In: 4th International conference on new trends in information science and service science. pp 133–138
Ali N, Guéhéneuc Y, Antoniol G (2013) Trustrace: Mining software repositories to improve the accuracy of requirement traceability links. IEEE Trans Softw Eng 39(5):725–741
Article Google Scholar
Almasri M, Berrut C, Chevallet J (2016) A comparison of deep learning based query expansion with pseudo-relevance feedback and mutual information. In: Advances in information retrieval - 38th European conference on IR research, ECIR 2016, Padua, Italy, March 20-23, 2016. Proceedings. pp 709–715
Antoniol G, Canfora G, Casazza G, Lucia AD, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Software Eng 28(10):970–983
Article Google Scholar
Asuncion HU, Asuncion A, Taylor RN (2010) Software traceability with topic modeling. In: 32nd ACM/IEEE International conference on software engineering (ICSE). pp 95–104
Asuncion HU, Taylor RN (2012) Automated techniques for capturing custom traceability links across heterogeneous artifacts. In: Software and systems traceability. pp 129–146
Bird S (2006) NLTK: the natural language toolkit. In: ACL 2006, 21st International conference on computational linguistics and 44th annual meeting of the association for computational linguistics, Proceedings of the Conference, Sydney, Australia, 17-21 July 2006
Calefato F, Lanubile F, P Minervini and (2010) Can real-time machine translation overcome language barriers in distributed requirements engineering?. In: 2010 5th IEEE International conference on global software engineering. IEEE, pp 257–264
Calefato F, Lanubile F, Prikladnicki R (2011) A controlled experiment on the effects of machine translation in multilingual requirements meetings. In: 6th IEEE International conference on global software engineering, ICGSE 2011, Helsinki, Finland, August 15-18, 2011. pp 94–102
Cleland-Huang J, Czauderna A, Dekhtyar A, Gotel O, Hayes JH, Keenan E, Leach G, Maletic JI, Poshyvanyk D, Shin Y, Zisman A, Antoniol G, Berenbach B, Egyed A, Mȧder P (2011) Grand challenges, benchmarks, and tracelab: developing infrastructure for the software traceability research community. In: TEFSE’11, Proceedings of the 6th International workshop on traceability in emerging forms of software engineering, May 23, 2011, Waikiki, Honolulu, HI, USA. pp 17–23
Cleland-Huang J, Gotel O, Hayes JH, Mäder P, Zisman A (2014) Software traceability: trends and future directions. In: FOSE. pp 55–69
Cleland-Huang J, Rahimi M, Mȧder P (2014) Achieving lightweight trustworthy traceability. In: Proceedings of the 22nd ACM SIGSOFT International symposium on foundations of software engineering, (FSE-22), Hong Kong, China, November 16 - 22, 2014. pp 849–852
Conneau A, Lample G, Ranzato M, Denoyer L, Jégou H. (2017) Word translation without parallel data. arXiv:1710.04087
Conneau A, Lample G, Rinott R, Williams A, Bowman SR, Schwenk H, Stoyanov V (2018) Xnli: Evaluating cross-lingual sentence representations. arXiv:1809.05053
Cover TM, Thomas JA (2006) Elements of information theory (Wiley series in telecommunications and signal processing). Wiley-Interscience, New York
Google Scholar
Cruz BD, Jayaraman B, Dwarakanath A, McMillan C (2017) Detecting vague words & phrases in requirements documents in a multilingual environment. In: 2017 IEEE 25th International requirements engineering conference (RE). pp 233–242. IEEE
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding arXiv:1810.04805
Dhingra B, Zhou Z, Fitzpatrick D, Muehl M, Cohen WW (2016) Tweet2vec: Character-based distributed representations for social media. In: Proceedings of the 54th annual meeting of the association for computational linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers
Fu Y (2021) Who offers the best chinese-english machine translation? a comparison of google, microsoft bing, baidu, tencent, sogou, and netease youdao
Google-Research (2019) Github Repository: Multilingual Models google-research/bert
Gotel O, Cleland-Huang J, Huffman Hayes J, Zisman A, Egyed A, Grünbacher P., Antoniol G (2012) The quest for ubiquity: A roadmap for software and systems traceability research. In: 21st IEEE International requirements engineering conference (RE). pp 71–80
Gotel OCZ, Finkelstein A (1994) An analysis of the requirements traceability problem. In: Proceedings of the first IEEE international conference on requirements engineering, ICRE ’94, Colorado Springs, Colorado, USA, April 18-21, 1994. pp 94–101
Gouws S, Bengio Y, Corrado G (2015) Bilbowa: Fast bilingual distributed representations without word alignments. In: Proceedings of the 32nd International conference on machine learning, ICML 2015, Lille, France, 6-11 July 2015. pp 748–756
Guo J, Cheng J, Cleland-Huang J (2017) Semantically enhanced software traceability using deep learning techniques. In: Proceedings of the 39th international conference on software engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017. pp 3–14
Guo J, Cleland-Huang J, Berenbach B (2013) Foundations for an expert system in domain-specific traceability. In: 21st IEEE International requirements engineering conference, RE 2013, Rio de Janeiro-RJ, Brazil, July 15-19, 2013. IEEE Computer Society, pp 42–5
Harris Z (1954) Distributional structure. Word 10(23):146–162
Article Google Scholar
Hayes JH, Dekhtyar A, Sundaram SK (2006) Advancing candidate link generation for requirements tracing: The study of methods. IEEE Trans Software Eng 32(1):4–19
Article Google Scholar
Hilgert L, Lopes L, Freitas A, Vieira R, Hogetop D, Vanim A (2014) Building domain specific bilingual dictionaries. In: Proceedings of the ninth international conference on language resources and evaluation (LREC’14), 2014, Islândia
Hofmann T (1999) Probabilistic latent semantic indexing. In: SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, August 15-19, 1999, Berkeley, CA, USA. pp 50–57
Jenkins J (1999) New ideographs in unicode 3.0 and beyond. In: Proceedings of the 15th international unicode conference C, vol 15. pp 1–2
Johnson M, Schuster M, Le QV, Krikun M, Wu Y, Chen Z, Thorat N, Viégas F, Wattenberg M, Corrado G et al (2017) Google’s multilingual neural machine translation system: Enabling zero-shot translation. Trans Assoc Comput Linguist 5:339–351
Jones E, Oliphant T, Peterson P et al (2001) SciPy: Open source scientific tools for Python. [Online; accessed < today >]
Joulin A, Bojanowski P, Mikolov T, Jégou H., Grave E (2018) Loss in translation: Learning bilingual word mapping with a retrieval criterion. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, October 31 - November 4, 2018. pp 2979–2984
Kailath T (1967) The divergence and bhattacharyya distance measures in signal selection. IEEE Trans Commun Technol 15(1):52–60
Article Google Scholar
Khandkar SH (2009) Open coding. University of Calgary, 23:2009
Krishna S, Sahay S, Walsham G (2004) Managing cross-cultural issues in global software outsourcing. Commun ACM 47(4):62–66
Article Google Scholar
Liu Y, Lin J, Cleland-Huang J (2020) Traceability support for multi-lingual software projects. In: Kim S, Gousios G, Nadi S, Hejderup J (eds) MSR ’20: 17th International conference on mining software repositories, Seoul, Republic of Korea, 29-30 June, 2020. ACM, pp 443–454
Liu Y, Lin J, Zeng Q, Jiang M, Cleland-Huang J (2020) Towards semantically guided traceability. In: 2020 IEEE 28th International requirements engineering conference (RE). pp 328–333. IEEE
Lohar S, Amornborvornwong S, Zisman A, Cleland-Huang J (2013) Improving trace accuracy through data-driven configuration and composition of tracing features. In: Joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC/FSE’13, Saint Petersburg, Russian Federation, August 18-26, 2013. pp 378–388
Lormans M, Van Deursen A (2006) Can lsi help reconstructing requirements traceability in design and test?. In: Conference on software maintenance and reengineering (CSMR’06). IEEE, pp 10–pp
Lucia AD, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans Softw Eng Methodol 16(4)
Lutz B (2009) Linguistic challenges in global software development: Lessons learned in an international SW development division. In: 4th IEEE International conference on global software engineering, ICGSE 2009, Limerick, Ireland, 13-16 July, 2009. pp 249–253
Mȧder P, Gotel O (2012) Towards automated traceability maintenance. J Syst Softw 85(10):2205–2227
Article Google Scholar
Meeker M, Wu L (2018) Internet trends 2018
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A (2018) Advances in pre-training distributed word representations. In: Proceedings of the International conference on language resources and evaluation (LREC 2018)
Monti J, Monteleone M, Di Buono MP, Marano F (2013) Natural language processing and big data-an ontology-based approach for cross-lingual information retrieval. In: 2013 International conference on social computing. IEEE, pp 725–731
Moulin C, Sugawara K, Fujita S, Wouters L, Manabe Y (2009) Multilingual collaborative design support system. In: Proceedings of the 13th International conference on computers supported cooperative work in design, CSCWD 2009, April 22-24, 2009, Santiago, Chile. pp 312–318
Muhr M, Kern R, Zechner M, Granitzer M (2010) External and intrinsic plagiarism detection using a cross-lingual retrieval and segmentation system. In: Notebook papers of CLEF 2010 LABs and workshops
Oliveto R, Gethers M, Poshyvanyk D, Lucia AD (2010) On the equivalence of information retrieval methods for automated traceability link recovery. In: The 18th IEEE International conference on program comprehension, ICPC 2010, Braga, Minho, Portugal, June 30-July 2, 2010. pp 68–71
Pawelka T, Juergens E (2015) Is this code written in english? a study of the natural language of comments and identifiers in practice. In: 2015 IEEE International conference on software maintenance and evolution (ICSME). IEEE, pp 401–410
Rath M, Rendall J, Guo JLC, Cleland-Huang J, Mȧder P (2018) Traceability in the wild: automatically augmenting incomplete trace links. In: Proceedings of the 40th international conference on software engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018. pp 834–845
Rempel P, Mäder P, Kuschke T, Cleland-Huang J (2015) Traceability gap analysis for assessing the conformance of software traceability to relevant guidelines. In: Software engineering & management 2015, Multikonferenz der GI-Fachbereiche Softwaretechnik (SWT) und Wirtschaftsinformatik, Dresden, Germany. pp 120–121
Ruder S, Vuli’c I, Sogaard A (2017) A survey of cross-lingual word embedding models
Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv:1910.01108
Shin Y, Hayes JH, Cleland-Huang J (2015) Guidelines for benchmarking automated software traceability techniques. In: 8th IEEE/ACM International symposium on software and systems traceability, SST 2015, Florence, Italy, May 17, 2015. pp 61–67
Spanoudakis G, Zisman A, Pérez-Miñana E, Krause P (2004) Rule-based generation of requirements traceability relations. J Syst Softw 72(2):105–127
Article Google Scholar
Tang G, Xia Y, Zhang M, Li H, Zheng F (2011) CLGVSM: adapting generalized vector space model to cross-lingual document clustering. In: Fifth International joint conference on natural language processing, IJCNLP 2011, Chiang Mai, Thailand, November 8-13, 2011. pp 580–588
Trec-Kba, trec-kba/many-stop-words (2021)
Treude C, Prolo CA, Figueira Filho F (2015) Challenges in analyzing software documentation in portuguese. In: 2015 29th Brazilian symposium on software engineering. IEEE, pp 179–184
Tsatsaronis G, Panagiotopoulou V (2009) A generalized vector space model for text retrieval based on semantic relatedness. In: EACL 2009, 12th conference of the european chapter of the association for computational linguistics, Proceedings of the Conference, Athens, Greece, March 30 - April 3, 2009. pp 70–78
Tsatsaronis G, Varlamis I, Vazirgiannis M (2010) Text relatedness based on a word thesaurus. J Artif Intell Res 37:1–39
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems. pp 5998–6008
Vulic I (2017) Cross-lingual syntactically informed distributed word representations. In: Proceedings of the 15th conference of the european chapter of the association for computational linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers. pp 408–414
Vulic I, Moens M (2015) Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, Santiago, Chile, August 9-13, 2015. pp 363–372
Wada T, Iwata T (2018) Unsupervised cross-lingual word embedding by multilingual neural language models. arXiv:1809.02306
Wong SKM, Ziarko W, Raghavan VV, Wong PCN (1989) Extended boolean query processing in the generalized vector space model. Inf Syst 14(1):47–63
Article Google Scholar
Wong SKM, Ziarko W, Wong PCN (1985) Generalized vector space model in information retrieval. In: Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval, Montréal, Québec, Canada, June 5-7, 1985. pp 18–25
Woolson R (2007) Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials. pp 1–3
Wouters L, Kaeri Y, Sugawara K (2013) Multi-domain multi-lingual collaborative design. In: Proceedings of the 2013 IEEE 17th International conference on computer supported cooperative work in design (CSCWD), IEEE, pp 269–274
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144
Xia X, Lo D, Wang X, Zhang C, Wang X (2014) Cross-language bug localization. In: Proceedings of the 22nd International conference on program comprehension. pp 275–278
Xu B, Xing Z, Xia X, Lo D, Li S (2018) Domain-specific cross-language relevant question retrieval. Empir Softw Eng 23(2):1084–1122
Article Google Scholar
Ye X, Qi Z, Massey D (2015) Learning relevance from click data via neural network based similarity models. In: 2015 IEEE International conference on big data, Big Data 2015, Santa Clara, CA. pp 801–806
Zhao T, Cao Q, Sun Q (2017) An improved approach to traceability recovery based on word embeddings. In: 24th Asia-pacific software engineering conference, APSEC 2017, Nanjing, China, December 4-8, 2017. pp 81–89

Download references

Acknowledgements

The work described in this paper has been partially funded by United States National Science Foundation grants CCF-1649448 and SHF-1901059.

Author information

Authors and Affiliations

University of Notre Dame, Notre Dame, IN, USA
Jinfeng Lin, Yalin Liu & Jane Cleland-Huang

Authors

Jinfeng Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yalin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jane Cleland-Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinfeng Lin.

Additional information

Communicated by: Georgios Gousios and Sarah Nadi

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Mining Software Repositories (MSR)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, J., Liu, Y. & Cleland-Huang, J. Information retrieval versus deep learning approaches for generating traceability links in bilingual projects. Empir Software Eng 27, 5 (2022). https://doi.org/10.1007/s10664-021-10050-0

Download citation

Accepted: 20 September 2021
Published: 22 October 2021
DOI: https://doi.org/10.1007/s10664-021-10050-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Information retrieval versus deep learning approaches for generating traceability links in bilingual projects

Abstract

Access this article

Similar content being viewed by others

Traceability recovery between bug reports and test cases-a Mozilla Firefox case study

On the relationship between similar requirements and similar software

Experimenting with information retrieval methods in the recovery of feature-code SPL traces

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Information retrieval versus deep learning approaches for generating traceability links in bilingual projects

Abstract

Access this article

Similar content being viewed by others

Traceability recovery between bug reports and test cases-a Mozilla Firefox case study

On the relationship between similar requirements and similar software

Experimenting with information retrieval methods in the recovery of feature-code SPL traces

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation