Abstract
The first generation of software retrieval systems developed some 25 years ago used simple bibliographic indexing techniques adapted from library science to support the retrieval of relatively small numbers of in-house software artifacts. While these were sufficient at the time, they were completely unscaleable to the vast numbers of software artifacts available today. The second generation of software search engines, representing the state-of-the-practice today, tackles this problem by using full-text search frameworks such as Lucene to support text-based searches on large software collections. However, these typically provide no inherent support for sophisticated search use cases which exploit the structure and “meaning” of software artifacts. In this chapter we describe the core techniques used in current text-based code search engines and advanced techniques that can be used to support sophisticated forms of searches that exploit the structure of software. We then survey the challenges and opportunities encountered in the development of the next (third) generation of software search engines based on new, currently emerging data storage platforms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Page, L., Brin, S., Motwani, R., Winograd, T.: The Pagerank Algorithm: Bringing Order to the Web. Proceedings of the International Conference on the World Wide Web (1998)
McIlroy, D.: Mass-Produced Software Components. Software Engineering: Report of a conference sponsored by the NATO Science Committee (1968).
Krueger, C.W.: Software reuse. ACM Computing Surveys, vol. 24, no 2. (1992)
Frakes, W.B., Nejneh, B.: An Information System for Software Reuse. Software Reuse: Emerging Technology, Computer Society Press (1987)
Frakes, W.B.: An empirical study of representation methods for reusable software components. IEEE Transactions on Software Engineering, Vol. 20, no.8 (1994)
Prieto-Diaz, R., Freeman, P.: Classifying Software for Reusability. IEEE Software, Vol. 4, No. 1 (1987)
Mili, A., Mili, R., Mittermeir, R.: A Survey of Software Reuse Libraries. Annals of Software Engineering 5 (1998)
Hoffmann, R. and Fogarty, J. and Weld, D.S.: Assieme: Finding and Leveraging implicit References in a Web Search Interface for Programmers. Proceedings of the ACM Symposium on User Interface Software and Technology (2007)
Hummel, O.: Facilitating the comparison of software retrieval systems through a reference reuse collection. Proceedings of the ICSE Workshop on Search-driven Development: Users, Infrastructure, Tools and Evaluation (2010)
Hummel, O., Janjic, J.: Test-Driven Reuse: Key to Improving Precision of Search Engines for Software Reuse. In Sim and Gallardo (eds.): Code Retrieval on the Web, Springer (2012)
Zaremski, A.M., Wing, J.M.: Signature Matching: A Tool for Using Software Libraries. ACM Transactions on Software Engineering and Methodology, Vol. 4, No. 2 (1995)
Umarji, M. and Sim, S. and Lopes, C.: Archetypal internet-scale source code searching. Open Source Development, Communities and Quality, Springer (2008)
Zaremski, A.M., Wing, J.M.: Specification Matching of Software Components. ACM Transactions on Software Engineering and Methodology, Vol. 6, No. 4 (1997)
Applications and web applications using lucene, http://wiki.apache.org/lucene-java/PoweredBy (2012)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley (1999)
Hatcher, E., Gospodnetic, O., McCandless, M.: Lucene in Action (2nd edition). Manning (2010)
Inoue, K., Yokomori, R., Fujiwara, H., Yamamoto, T., Matsushita, M., Kusumoto S.: Ranking Significance of Software Components Based on Use Relations. IEEE Transactions on Software Engineering, Vol. 31, No. 3 (2005)
Merobase - Software Component Search Engine, http://www.merobase.com (retr. 2012)
Krugle - Open Search, http://opensearch.krugle.org (retr. 2012)
Sourcerer, http://sourcerer.ics.uci.edu/sourcerer (retr. 2012)
Koders, http://koders.com (retr. 2012)
JBoss Community: Hibernate-Search, http://hibernate.org/subprojects/search.html (retr. 2012)
Google Blog: A fall Sweep, http://googleblog.blogspot.com/2011/10/fall-sweep.html (2011)
Bajracharya, S., Ossher, J., Lopes, C.: Leveraging usage similarity for effective retrieval of examples in code repositories. In Proceedings of the Int. ACM SIGSOFT Symposium on Foundations of Software Engineering (2010)
Hummel, O.: Semantic Component Retrieval in Software Engineering. PhD dissertation, University of Mannheim (2008)
Hummel, O., Janjic, W., Atkinson, C.: Evaluating the efficiency of retrieval methods for component repositories. Proceedings of the International Conference on Software Engineering and Knowledge Engineering (2007)
Linping, Q., Lidong, W.: An Evaluation of Lucene for Keywords Search in Large-scale Short Text Storage. Computer Design and Applications (2010)
Panchenko, O., Müller, S., Plattner, H., Zeier, A.: Querying Source Code Using a Controlled Natural Language. Proceedings of the International Conference on Software Engineering and Applications (2011)
Panchenko, O., Karstens, J., Plattner, H., Zeier, A: Precise and Scalable Querying of Syntactical Source Code Patterns Using Sample Code Snippets and a Database. Proceedings of the International Conference on Program Comprehension (2011)
Podgurski, A., Pierce, L.: Retrieving reusable software by sampling behavior. ACM Transactions on Software Engineering and Methodology, Vol.2, No. 3 (1993)
Janjic, W., Hummel, O., Atkinson, C.: More archetypal usage scenarios for software search engines. Proceedings of the ICSE Workshop on Search-driven Development: Users, Infrastructure, Tools and Evaluation (2010)
Sametinger, J.: Software engineering with reusable components. Springer (1997)
Thummalapenta, S. Xie, T.: Parseweb: a programmer assistant for reusing open source code on the web. Proceedings of the International Conference on Automated Software Engineering (2007)
Lemos, O., Bajracharya, S., Ossher, J.: CodeGenie: a tool for test-driven source code search. Proceedings of the International Conference on Object-Oriented Programming (2007)
Bajracharya, S.: Infrastructure for Building Search Tools for Developers. In Sim and Gallardo-Valencia (eds.): Finding Source Code on the Web for Remix and Reuse, Springer, 2012.
Software Engineering Group, University of Mannheim: Merobase Data Sets, http://merobase.informatik.uni-mannheim.de/sources (retr. 2012)
Acknowledgements
The authors would like to thank Philipp Bostan, Matthias Gutheil, Werner Janjic and Dietmar Stoll from the Software Engineering Group at the University of Mannheim for their contributions to developing the tools described in this chapter.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media New York
About this chapter
Cite this chapter
Hummel, O., Atkinson, C., Schumacher, M. (2013). Artifact Representation Techniques for Large-Scale Software Search Engines. In: Sim, S.E., Gallardo-Valencia, R.E. (eds) Finding Source Code on the Web for Remix and Reuse. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6596-6_5
Download citation
DOI: https://doi.org/10.1007/978-1-4614-6596-6_5
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-6595-9
Online ISBN: 978-1-4614-6596-6
eBook Packages: Computer ScienceComputer Science (R0)