The need for software specific natural language techniques

Binkley, Dave; Lawrie, Dawn; Morrell, Christopher

doi:10.1007/s10664-017-9566-5

The need for software specific natural language techniques

Published: 25 November 2017

Volume 23, pages 2398–2425, (2018)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

675 Accesses
12 Citations
7 Altmetric
Explore all metrics

Abstract

For over two decades, software engineering (SE) researchers have been importing tools and techniques from information retrieval (IR). Initial results have been quite positive. For example, when applied to problems such as feature location or re-establishing traceability links, IR techniques work well on their own, and often even better in combination with more traditional source code analysis techniques such as static and dynamic analysis. However, recently there has been growing awareness among SE researchers that IR tools and techniques are designed to work under different assumptions than those that hold for a software system. Thus it may be beneficial to consider IR-inspired tools and techniques that are specifically designed to work with software. One aim of this work is to provide quantitative empirical evidence in support of this observation. To do so a new technique is introduced that captures the level of difficulty found in an information need, the true, often latent, information that a searcher desires to know. The new technique is used to compare two domains: Natural Language (NL) and SE. Analysis of the data leads to three significant findings. First, the variation in the distribution of difficulty of the SE information needs differs from that of the NL information needs; second, collection age plays a role in the differences between the NL collections; and finally, the retrieval model used has little impact on the results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural Language-Based Software Analyses and Tools for Software Maintenance

Semi-automatic Software Feature-Relevant Information Extraction from Natural Language User Manuals

Assessment of Software Testing and Quality Assurance in Natural Language Processing Applications and a Linguistically Inspired Approach to Improving It

Notes

Distributed by the Linguistic Data Consortium
http://www.cs.loyola.edu/~lawrie/papers/esme17-binkley-lawrie/english_stoplist
http://www.cs.loyola.edu/~lawrie/papers/esme17-binkley-lawrie/java_reserved_words

References

Abebe SL, Haiduc S, Tonella P, Marcus A (2009) Lexicon bad smells in software. In: 2009 16th Working Conference on Reverse Engineering. IEEE, Piscataway, pp 95–99
Alduailij M, Al-Duailej M (2015) Performance evaluation of information retrieval models in bug localization on the method level. In: 2015 international conference on collaboration technologies and systems (CTS). IEEE, Piscataway, pp 305–313
Arnaoudova V, Eshkevari LM, Di Penta M, Oliveto R, Antoniol G, Gueheneuc Y-G (2014) Repent: Analyzing the nature of identifier renamings. IEEE Trans Softw Eng 40(5):502–532
Article Google Scholar
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval, vol 463. ACM Press, New York
Google Scholar
Bavota G, De Lucia A, Oliveto R (2011) Identifying extract class refactoring opportunities using structural and semantic cohesion measures. J Syst Softw 84(3):397–414
Article Google Scholar
Binkley D, Lawrie D (2016) A case for software specific natural language techniques. In: 2016 IEEE 16th international working conference on source code analysis and manipulation (SCAM) . IEEE, Piscataway, pp 187–196
Binkley D, Lawrie DJ, Uehlinger C, Heinz D (2015) Enabling improved ir-based feature location. J Syst Softw 101(0):30–42
Article Google Scholar
Callan JP, Bruce Croft W, Harding SM (1992) The inquery retrieval system. In: Proceedings of the third international conference on database and expert systems applications, pp 78–83
Cleverdon C (1967) The cranfield tests on index language devices. Aslib proceedings 19(6):173–194. MCB UP Ltd
Article Google Scholar
De Lucia A, Oliveto R, Sgueglia P (2006) Incremental approach and user feedbacks: a silver bullet for traceability recovery. In: 22nd IEEE international conference on software maintenance, 2006. ICSM’06. IEEE, Piscataway, pp 299–309
De Lucia A, Di Penta M, Oliveto R (2011) Improving source code lexicon via traceability and information retrieval. IEEE Trans Softw Eng 37(2):205–227
Article Google Scholar
Dit B, Revelle M, Gethers M, Poshyvanyk D (2011) Feature location in source code: A taxonomy and survey. J Softw Maint Evol 23(7):107–117
Google Scholar
Enslen E, Hill E, Pollock L, Vijay-Shanker K (2009) Mining source code to automatically split identifiers for software analysis. In: Proceedings of the 2009 mining software repositories (MSR). IEEE, Piscataway
Gay G, Haiduc S, Marcus A, Menzies T (2009) On the use of relevance feedback in ir-based concept location. In: 2009. ICSM 2009. IEEE international conference on software maintenance. IEEE, Piscataway, pp 351–360
Guerrouj L (2010) Automatic derivation of concepts based on the analysis of source code identifiers. In: 2013 20th working conference on reverse engineering (WCRE), vol 0, pp 301–304
Krovetz R (1993) Viewing morphology as an inference process. In: Korfhage R et al (eds) Special interest group on information retrieval
Lavrenko V, Croft WB (2001) Relevance-based language models. In: Croft WB, Harper DJ, Kraft DH, Zobel J (eds) SIGIR conference on research and development in information retrieval
Lin J, Craig Murray G (2005) Assessing the term independence assumption in blind relevance feedback, ACM, New York
Manning C, Raghavan P, Schutze H (2008) Introduction to information retrieval, Cambridge University Press, Cambridge
Mccallum A (2002) Mallet: A machine learning for language toolkit
Pradel M, Gross TR (2011) Detecting anomalies in the order of equally-typed method arguments. In: Proceedings of the 2011 international symposium on software testing and analysis. ACM, New York, pp 232–242
Rao S, Kak A (2011) Retrieval from software libraries for bug localization: a comparative study of generic and composite text models. In: Proceedings of the 8th working conference on mining software repositories (MSR ’11). ACM, New York, pp 43–52
Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M et al (1995) Okapi at trec-3. Nist Special Publication Sp 109:109
Google Scholar
Saha RK (2016) Effective bug detection and localization using information retrieval. PhD thesis, University of Texas, Austin
Google Scholar
Savage T., Revelle M., Poshyvanyk D (2010) Flat^∧3: Feature location and textual tracing tool. In: Proceedings of 32nd ACM/IEEE international conference on software engineering (ICSE’10), formal research tool demonstration. ACM, New York
Sisman B, Kak AC (2013) Assisting code search with automatic query reformulation for bug localization. In: Proceedings of the 10th working conference on mining software repositories. IEEE Press, Piscataway, pp 309–318
Tian K, Revelle M, Poshyvanyk D (2009) Using latent dirichlet allocation for automatic categorization of software. In: 2009 MSR’09. 6th, IEEE International Working Conference on Mining software repositories. IEEE, Piscataway, pp 163–166
Ellen M (2008) Voorhees. Overview of trec 2007. Technical report
Voorhees EM, Hardman DK (1999) Overview of the eightj text retrieval conference (trec-8). In: Trec, vol 99, pp 1–25
Wang S, Lo D, Lawall J (2014) Compositional vector space models for improved bug localization. In: 2014 IEEE international conference on Software maintenance and evolution (ICSME) . IEEE, Piscataway, pp 171–180
Wei X, Croft WB (2006) LDA-based document models for ad-hoc retrieval. In: Conference on Research and Development in Information Retrieval, Seattle
Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS) 22 (2):179–214
Article Google Scholar
Zhai C, Lafferty J (2017) A study of smoothing methods for language models applied to ad hoc information retrieval. SIGIR Forum 51(2):268–276
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by NSF grant 1626262. We are very thankful to the original SCAM reviewers and especially to the EMSE reviewers whose comments were insightful and led to considerable improvements to the paper.

Author information

Authors and Affiliations

Loyola University Maryland, Baltimore, MD, USA
Dave Binkley, Dawn Lawrie & Christopher Morrell

Authors

Dave Binkley
View author publications
You can also search for this author in PubMed Google Scholar
Dawn Lawrie
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Morrell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dave Binkley.

Additional information

Communicated by: Michaela Greiler and Gabriele Bavota

Rights and permissions

Reprints and permissions

About this article

Cite this article

Binkley, D., Lawrie, D. & Morrell, C. The need for software specific natural language techniques. Empir Software Eng 23, 2398–2425 (2018). https://doi.org/10.1007/s10664-017-9566-5

Download citation

Published: 25 November 2017
Issue Date: August 2018
DOI: https://doi.org/10.1007/s10664-017-9566-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The need for software specific natural language techniques

Abstract

Access this article

Similar content being viewed by others

Natural Language-Based Software Analyses and Tools for Software Maintenance

Semi-automatic Software Feature-Relevant Information Extraction from Natural Language User Manuals

Assessment of Software Testing and Quality Assurance in Natural Language Processing Applications and a Linguistically Inspired Approach to Improving It

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The need for software specific natural language techniques

Abstract

Access this article

Similar content being viewed by others

Natural Language-Based Software Analyses and Tools for Software Maintenance

Semi-automatic Software Feature-Relevant Information Extraction from Natural Language User Manuals

Assessment of Software Testing and Quality Assurance in Natural Language Processing Applications and a Linguistically Inspired Approach to Improving It

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation