Skip to main content

Artifact Representation Techniques for Large-Scale Software Search Engines

  • Chapter
Finding Source Code on the Web for Remix and Reuse

Abstract

The first generation of software retrieval systems developed some 25 years ago used simple bibliographic indexing techniques adapted from library science to support the retrieval of relatively small numbers of in-house software artifacts. While these were sufficient at the time, they were completely unscaleable to the vast numbers of software artifacts available today. The second generation of software search engines, representing the state-of-the-practice today, tackles this problem by using full-text search frameworks such as Lucene to support text-based searches on large software collections. However, these typically provide no inherent support for sophisticated search use cases which exploit the structure and “meaning” of software artifacts. In this chapter we describe the core techniques used in current text-based code search engines and advanced techniques that can be used to support sophisticated forms of searches that exploit the structure of software. We then survey the challenges and opportunities encountered in the development of the next (third) generation of software search engines based on new, currently emerging data storage platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Page, L., Brin, S., Motwani, R., Winograd, T.: The Pagerank Algorithm: Bringing Order to the Web. Proceedings of the International Conference on the World Wide Web (1998)

    Google Scholar 

  2. McIlroy, D.: Mass-Produced Software Components. Software Engineering: Report of a conference sponsored by the NATO Science Committee (1968).

    Google Scholar 

  3. Krueger, C.W.: Software reuse. ACM Computing Surveys, vol. 24, no 2. (1992)

    Google Scholar 

  4. Frakes, W.B., Nejneh, B.: An Information System for Software Reuse. Software Reuse: Emerging Technology, Computer Society Press (1987)

    Google Scholar 

  5. Frakes, W.B.: An empirical study of representation methods for reusable software components. IEEE Transactions on Software Engineering, Vol. 20, no.8 (1994)

    Google Scholar 

  6. Prieto-Diaz, R., Freeman, P.: Classifying Software for Reusability. IEEE Software, Vol. 4, No. 1 (1987)

    Google Scholar 

  7. Mili, A., Mili, R., Mittermeir, R.: A Survey of Software Reuse Libraries. Annals of Software Engineering 5 (1998)

    Google Scholar 

  8. Hoffmann, R. and Fogarty, J. and Weld, D.S.: Assieme: Finding and Leveraging implicit References in a Web Search Interface for Programmers. Proceedings of the ACM Symposium on User Interface Software and Technology (2007)

    Google Scholar 

  9. Hummel, O.: Facilitating the comparison of software retrieval systems through a reference reuse collection. Proceedings of the ICSE Workshop on Search-driven Development: Users, Infrastructure, Tools and Evaluation (2010)

    Book  Google Scholar 

  10. Hummel, O., Janjic, J.: Test-Driven Reuse: Key to Improving Precision of Search Engines for Software Reuse. In Sim and Gallardo (eds.): Code Retrieval on the Web, Springer (2012)

    Google Scholar 

  11. Zaremski, A.M., Wing, J.M.: Signature Matching: A Tool for Using Software Libraries. ACM Transactions on Software Engineering and Methodology, Vol. 4, No. 2 (1995)

    Google Scholar 

  12. Umarji, M. and Sim, S. and Lopes, C.: Archetypal internet-scale source code searching. Open Source Development, Communities and Quality, Springer (2008)

    Book  Google Scholar 

  13. Zaremski, A.M., Wing, J.M.: Specification Matching of Software Components. ACM Transactions on Software Engineering and Methodology, Vol. 6, No. 4 (1997)

    Google Scholar 

  14. Applications and web applications using lucene, http://wiki.apache.org/lucene-java/PoweredBy (2012)

  15. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley (1999)

    Google Scholar 

  16. Hatcher, E., Gospodnetic, O., McCandless, M.: Lucene in Action (2nd edition). Manning (2010)

    Google Scholar 

  17. Inoue, K., Yokomori, R., Fujiwara, H., Yamamoto, T., Matsushita, M., Kusumoto S.: Ranking Significance of Software Components Based on Use Relations. IEEE Transactions on Software Engineering, Vol. 31, No. 3 (2005)

    Google Scholar 

  18. Merobase - Software Component Search Engine, http://www.merobase.com (retr. 2012)

  19. Krugle - Open Search, http://opensearch.krugle.org (retr. 2012)

  20. Sourcerer, http://sourcerer.ics.uci.edu/sourcerer (retr. 2012)

  21. Koders, http://koders.com (retr. 2012)

  22. JBoss Community: Hibernate-Search, http://hibernate.org/subprojects/search.html (retr. 2012)

  23. Google Blog: A fall Sweep, http://googleblog.blogspot.com/2011/10/fall-sweep.html (2011)

  24. Bajracharya, S., Ossher, J., Lopes, C.: Leveraging usage similarity for effective retrieval of examples in code repositories. In Proceedings of the Int. ACM SIGSOFT Symposium on Foundations of Software Engineering (2010)

    Google Scholar 

  25. Hummel, O.: Semantic Component Retrieval in Software Engineering. PhD dissertation, University of Mannheim (2008)

    Google Scholar 

  26. Hummel, O., Janjic, W., Atkinson, C.: Evaluating the efficiency of retrieval methods for component repositories. Proceedings of the International Conference on Software Engineering and Knowledge Engineering (2007)

    Google Scholar 

  27. Linping, Q., Lidong, W.: An Evaluation of Lucene for Keywords Search in Large-scale Short Text Storage. Computer Design and Applications (2010)

    Google Scholar 

  28. Panchenko, O., Müller, S., Plattner, H., Zeier, A.: Querying Source Code Using a Controlled Natural Language. Proceedings of the International Conference on Software Engineering and Applications (2011)

    Google Scholar 

  29. Panchenko, O., Karstens, J., Plattner, H., Zeier, A: Precise and Scalable Querying of Syntactical Source Code Patterns Using Sample Code Snippets and a Database. Proceedings of the International Conference on Program Comprehension (2011)

    Google Scholar 

  30. Podgurski, A., Pierce, L.: Retrieving reusable software by sampling behavior. ACM Transactions on Software Engineering and Methodology, Vol.2, No. 3 (1993)

    Google Scholar 

  31. Janjic, W., Hummel, O., Atkinson, C.: More archetypal usage scenarios for software search engines. Proceedings of the ICSE Workshop on Search-driven Development: Users, Infrastructure, Tools and Evaluation (2010)

    Book  Google Scholar 

  32. Sametinger, J.: Software engineering with reusable components. Springer (1997)

    Google Scholar 

  33. Thummalapenta, S. Xie, T.: Parseweb: a programmer assistant for reusing open source code on the web. Proceedings of the International Conference on Automated Software Engineering (2007)

    Google Scholar 

  34. Lemos, O., Bajracharya, S., Ossher, J.: CodeGenie: a tool for test-driven source code search. Proceedings of the International Conference on Object-Oriented Programming (2007)

    Google Scholar 

  35. Bajracharya, S.: Infrastructure for Building Search Tools for Developers. In Sim and Gallardo-Valencia (eds.): Finding Source Code on the Web for Remix and Reuse, Springer, 2012.

    Google Scholar 

  36. Software Engineering Group, University of Mannheim: Merobase Data Sets, http://merobase.informatik.uni-mannheim.de/sources (retr. 2012)

Download references

Acknowledgements

The authors would like to thank Philipp Bostan, Matthias Gutheil, Werner Janjic and Dietmar Stoll from the Software Engineering Group at the University of Mannheim for their contributions to developing the tools described in this chapter.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oliver Hummel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this chapter

Cite this chapter

Hummel, O., Atkinson, C., Schumacher, M. (2013). Artifact Representation Techniques for Large-Scale Software Search Engines. In: Sim, S.E., Gallardo-Valencia, R.E. (eds) Finding Source Code on the Web for Remix and Reuse. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6596-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-6596-6_5

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-6595-9

  • Online ISBN: 978-1-4614-6596-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics