Abstract
For over two decades, software engineering (SE) researchers have been importing tools and techniques from information retrieval (IR). Initial results have been quite positive. For example, when applied to problems such as feature location or re-establishing traceability links, IR techniques work well on their own, and often even better in combination with more traditional source code analysis techniques such as static and dynamic analysis. However, recently there has been growing awareness among SE researchers that IR tools and techniques are designed to work under different assumptions than those that hold for a software system. Thus it may be beneficial to consider IR-inspired tools and techniques that are specifically designed to work with software. One aim of this work is to provide quantitative empirical evidence in support of this observation. To do so a new technique is introduced that captures the level of difficulty found in an information need, the true, often latent, information that a searcher desires to know. The new technique is used to compare two domains: Natural Language (NL) and SE. Analysis of the data leads to three significant findings. First, the variation in the distribution of difficulty of the SE information needs differs from that of the NL information needs; second, collection age plays a role in the differences between the NL collections; and finally, the retrieval model used has little impact on the results.
Similar content being viewed by others
Notes
Distributed by the Linguistic Data Consortium
References
Abebe SL, Haiduc S, Tonella P, Marcus A (2009) Lexicon bad smells in software. In: 2009 16th Working Conference on Reverse Engineering. IEEE, Piscataway, pp 95–99
Alduailij M, Al-Duailej M (2015) Performance evaluation of information retrieval models in bug localization on the method level. In: 2015 international conference on collaboration technologies and systems (CTS). IEEE, Piscataway, pp 305–313
Arnaoudova V, Eshkevari LM, Di Penta M, Oliveto R, Antoniol G, Gueheneuc Y-G (2014) Repent: Analyzing the nature of identifier renamings. IEEE Trans Softw Eng 40(5):502–532
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval, vol 463. ACM Press, New York
Bavota G, De Lucia A, Oliveto R (2011) Identifying extract class refactoring opportunities using structural and semantic cohesion measures. J Syst Softw 84(3):397–414
Binkley D, Lawrie D (2016) A case for software specific natural language techniques. In: 2016 IEEE 16th international working conference on source code analysis and manipulation (SCAM) . IEEE, Piscataway, pp 187–196
Binkley D, Lawrie DJ, Uehlinger C, Heinz D (2015) Enabling improved ir-based feature location. J Syst Softw 101(0):30–42
Callan JP, Bruce Croft W, Harding SM (1992) The inquery retrieval system. In: Proceedings of the third international conference on database and expert systems applications, pp 78–83
Cleverdon C (1967) The cranfield tests on index language devices. Aslib proceedings 19(6):173–194. MCB UP Ltd
De Lucia A, Oliveto R, Sgueglia P (2006) Incremental approach and user feedbacks: a silver bullet for traceability recovery. In: 22nd IEEE international conference on software maintenance, 2006. ICSM’06. IEEE, Piscataway, pp 299–309
De Lucia A, Di Penta M, Oliveto R (2011) Improving source code lexicon via traceability and information retrieval. IEEE Trans Softw Eng 37(2):205–227
Dit B, Revelle M, Gethers M, Poshyvanyk D (2011) Feature location in source code: A taxonomy and survey. J Softw Maint Evol 23(7):107–117
Enslen E, Hill E, Pollock L, Vijay-Shanker K (2009) Mining source code to automatically split identifiers for software analysis. In: Proceedings of the 2009 mining software repositories (MSR). IEEE, Piscataway
Gay G, Haiduc S, Marcus A, Menzies T (2009) On the use of relevance feedback in ir-based concept location. In: 2009. ICSM 2009. IEEE international conference on software maintenance. IEEE, Piscataway, pp 351–360
Guerrouj L (2010) Automatic derivation of concepts based on the analysis of source code identifiers. In: 2013 20th working conference on reverse engineering (WCRE), vol 0, pp 301–304
Krovetz R (1993) Viewing morphology as an inference process. In: Korfhage R et al (eds) Special interest group on information retrieval
Lavrenko V, Croft WB (2001) Relevance-based language models. In: Croft WB, Harper DJ, Kraft DH, Zobel J (eds) SIGIR conference on research and development in information retrieval
Lin J, Craig Murray G (2005) Assessing the term independence assumption in blind relevance feedback, ACM, New York
Manning C, Raghavan P, Schutze H (2008) Introduction to information retrieval, Cambridge University Press, Cambridge
Mccallum A (2002) Mallet: A machine learning for language toolkit
Pradel M, Gross TR (2011) Detecting anomalies in the order of equally-typed method arguments. In: Proceedings of the 2011 international symposium on software testing and analysis. ACM, New York, pp 232–242
Rao S, Kak A (2011) Retrieval from software libraries for bug localization: a comparative study of generic and composite text models. In: Proceedings of the 8th working conference on mining software repositories (MSR ’11). ACM, New York, pp 43–52
Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M et al (1995) Okapi at trec-3. Nist Special Publication Sp 109:109
Saha RK (2016) Effective bug detection and localization using information retrieval. PhD thesis, University of Texas, Austin
Savage T., Revelle M., Poshyvanyk D (2010) Flat∧3: Feature location and textual tracing tool. In: Proceedings of 32nd ACM/IEEE international conference on software engineering (ICSE’10), formal research tool demonstration. ACM, New York
Sisman B, Kak AC (2013) Assisting code search with automatic query reformulation for bug localization. In: Proceedings of the 10th working conference on mining software repositories. IEEE Press, Piscataway, pp 309–318
Tian K, Revelle M, Poshyvanyk D (2009) Using latent dirichlet allocation for automatic categorization of software. In: 2009 MSR’09. 6th, IEEE International Working Conference on Mining software repositories. IEEE, Piscataway, pp 163–166
Ellen M (2008) Voorhees. Overview of trec 2007. Technical report
Voorhees EM, Hardman DK (1999) Overview of the eightj text retrieval conference (trec-8). In: Trec, vol 99, pp 1–25
Wang S, Lo D, Lawall J (2014) Compositional vector space models for improved bug localization. In: 2014 IEEE international conference on Software maintenance and evolution (ICSME) . IEEE, Piscataway, pp 171–180
Wei X, Croft WB (2006) LDA-based document models for ad-hoc retrieval. In: Conference on Research and Development in Information Retrieval, Seattle
Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS) 22 (2):179–214
Zhai C, Lafferty J (2017) A study of smoothing methods for language models applied to ad hoc information retrieval. SIGIR Forum 51(2):268–276
Acknowledgements
This work was supported in part by NSF grant 1626262. We are very thankful to the original SCAM reviewers and especially to the EMSE reviewers whose comments were insightful and led to considerable improvements to the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Michaela Greiler and Gabriele Bavota
Rights and permissions
About this article
Cite this article
Binkley, D., Lawrie, D. & Morrell, C. The need for software specific natural language techniques. Empir Software Eng 23, 2398–2425 (2018). https://doi.org/10.1007/s10664-017-9566-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-017-9566-5