Skip to main content

Advertisement

Log in

Searching for Software on the EGEE Infrastructure

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Several large-scale Grid infrastructures are currently in operation around the world, federating an impressive collection of computational resources, a wide variety of application software, and hundreds of user communities. To better serve the current and prospective users of Grid infrastructures, it is important to develop advanced software retrieval services that could help users locate software components suitable to their needs. In this paper, we present the design and implementation of Minersoft, a distributed, multi-threaded harvester for application software located in large-scale Grid infrastructures. Minersoft crawls the sites of a Grid infrastructure, discovers installed software resources, annotates them with keyword-rich metadata, and creates inverted indexes that can be used to support full-text software retrieval. We present insights derived from the implementation and deployment of Minersoft on EGEE, one of the largest Grid production services currently in operation. Experimental results show that Minersoft achieves a high performance in crawling EGEE sites and discovering software-related files, and a high efficiency in supporting software retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Enabling Grids for E-sciencE project.: http://www.eu-egee.org/. Last accessed: February 2010

  2. teragrid.: http://www.teragrid.org/index.php. Last accessed: February 2010

  3. Agrawal, R., et al.: The Claremont report on database research. SIGMOD Rec. 37(3), 9–19 (2008)

    Article  Google Scholar 

  4. Al-Maskari, A., Sanderson, M., Clough, P.: The relationship between IR effectiveness measures and user satisfaction. In: SIGIR ’07, New York, NY, USA, pp. 773–774 (2007)

  5. Ames, A., Maltzahn, C., Bobb, N., Miller, E.L., Brandt, S.A., Neeman, A., Hiatt, A. Tuteja, D.: Richer file system metadata using links and attributes. In: MSST ’05, Washington, DC, USA, pp. 49–60. IEEE Computer Society, Washington, DC (2005)

    Google Scholar 

  6. Antoniol, G., Canfora, G., Casazza, G., Lucia, A.D., Merlo, E.: Recovering traceability links between code and documentation. IEEE Trans. Softw. Eng. 28(10), 970–983 (2002)

    Article  Google Scholar 

  7. Bao, S., Xue, G., Wu, X., Yu, Y., Fei, B., Su, Z.: Optimizing web search using social annotations. In: WWW ’07, New York, NY, USA, pp. 501–510. ACM, New York (2007)

    Google Scholar 

  8. Bass, L., Clements, P., Kazman, R., Klein, M.: Evaluating the software architecture competence of organizations. In: WICSA ’08, pp. 249–252 (2008)

  9. Borodin, A., Roberts, G.O., Rosenthal, J.S., Tsaparas, P.: Link analysis ranking: algorithms, theory, and experiments. ACM TOIT 5(1), 231–297 (2005)

    Article  Google Scholar 

  10. Brochu, F., Egede, U., Elmsheuser, J., Harrison, K., et al.: Ganga: a tool for computational-task management and easy access to Grid resources. Comput. Phys. Commun. 180, 2303–2316 (2009). http://ganga.web.cern.ch/ganga/documents/index.php

    Article  Google Scholar 

  11. Clarke, C.L., et al.: Novelty and diversity in information retrieval evaluation. In: SIGIR ’08, New York, NY, USA, pp. 659–666. ACM, New York (2008)

    Google Scholar 

  12. Cohen, S., Domshlak, C., Zwerdling, N.: On ranking techniques for desktop search. ACM TOIS 26(2), 1–24 (2008)

    Article  Google Scholar 

  13. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of OSDI ’04: 6th Symposium on Operating System Design and Implementation, pp. 137–150. Usenix Association, Berkeley (2004)

    Google Scholar 

  14. Dikaiakos, M.D., Sakellariou, R., Ioannidis, Y.: Information services for large-scale grids: a case for a Grid search engine. In: Chapter Engineering the Grid: Status and Perspectives, pp. 571–585. American Scientific, Stevenson Ranch (2006)

    Google Scholar 

  15. Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the Grid: enabling scalable virtual organizations. Int. J. Supercomput. Appl. 15(3), 200–222 (2001)

    Article  Google Scholar 

  16. Gifford, D.K., Jouvelot, P., Sheldon, M.A., O’Toole, J.J.W.: Semantic file systems. In: SOSP ’91, New York, NY, USA, pp. 16–25. ACM, New York (1991)

    Google Scholar 

  17. Gyllstrom, K.A., Soules, C., Veitch, A.: Confluence: enhancing contextual desktop search. In: SIGIR ’07, New York, NY, USA, pp. 717–718. ACM, New York (2007)

    Google Scholar 

  18. Hummel, O., Atkinson, C.: Extreme harvesting: test driven discovery and reuse of software components. In: Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, IRI—2004, Las Vegas Hilton, Las Vegas, NV, USA, pp. 66–72 (2004)

  19. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)

    Article  Google Scholar 

  20. Kao, H.-Y., Lin, S.-F.: A fast pagerank convergence method based on the cluster prediction. In: WI ’07, Washington, DC, USA, pp. 593–599. IEEE, Washington (2007)

    Google Scholar 

  21. Katsaros, D., Pallis, G., Stamos, K., Vakali, A., Sidiropoulos, A., Manolopoulos, Y.: CDNs content outsourcing via generalized communities. IEEE TKDE 21(1), 137–151 (2009)

    Google Scholar 

  22. Katsifodimos, A., Pallis, G., Dikaiakos, D.M.: Harvesting large-scale grids for software resources. In: CCGRID ’09, Shanghai, China. IEEE Computer Society, Shanghai (2009)

    Google Scholar 

  23. Khemakhem, S., Drira, K., Jmaiel, M.: Sec+: an enhanced search engine for component-based software development. SIGSOFT Softw. Eng. Notes 32(4), 4 (2007)

    Article  Google Scholar 

  24. Koren, J., Leung, A., Zhang, Y., Maltzahn, C., Ames, S., Miller, E.: Searching and navigating petabyte-scale file systems based on facets. In: PDSW ’07, pp. 21–25 (2007)

  25. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: densification and shrinking diameters. ACM TKDD 1(1):2 (2007)

    Article  Google Scholar 

  26. Li, G., Ooi, B.C., Feng, J., Wang, J., Zhou, L.: Ease: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In: SIGMOD 2008, New York, NY, USA, pp. 903–914. ACM, New York (2008)

    Google Scholar 

  27. Linstead, E., Bajracharya, S., Ngo, T., Rigor, P., Lopes, C., Baldi, P.: Sourcerer: mining and searching internet-scale software repositories. Data Min Knowl Discov 18(2), 300–336 (2009)

    Article  Google Scholar 

  28. Lucia, A.D., Fasano, F., Oliveto, R., Tortora, G.: Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans. Softw. Eng. Methodol. 16(4), 13 (2007)

    Article  Google Scholar 

  29. Lucrédio, D., do Prado, A.F., de Almeida, E.S.: A survey on software components search and retrieval. In: Proceedings of the 30th Euromicro Conference, pp. 152–159 (2004)

  30. Maarek, Y.S., Berry, D.M., Kaiser, G.E.: An information retrieval approach for automatically constructing software libraries. IEEE Trans. Softw. Eng. 17(8), 800–813 (1991)

    Article  Google Scholar 

  31. Marcus, A., Maletic, J.: Recovering documentation-to-source-code traceability links using latent semantic indexing. In: ICSE 2003, pp. 125–135 (2003)

  32. Matsushita, M.: Ranking significance of software components based on use relations. IEEE Trans. Softw. Eng. 31(3), 213–225 (2005)

    Article  Google Scholar 

  33. Pallis, G., Katsifodimos, A., Dikaiakos, D.M.: Effective keyword search for software resources installed in large-scale Grid infrastructures. In: 2009 IEEE/WIC/ACM International Conference on Web Intelligence, Milano, Italy (2009)

  34. Ramage, D., Heymann, P., Manning, C.D., Garcia-Molina, H.: Clustering the tagged web. In: WSDM ’09, New York, NY, USA, pp. 54–63. ACM, New York (2009)

    Google Scholar 

  35. Robinson, D., Sung, I., Williams, N.: File systems, unicode, and normalization. In: Unicode ’06 (2006)

  36. Soules, C.A.N., Ganger, G.R.: Connections: using context to enhance file search. SIGOPS Oper. Syst. Rev. 39(5), 119–132 (2005)

    Article  Google Scholar 

  37. Teevan, J., Adar, E., Jones, R., Potts, M.A.S.: Information re-retrieval: repeat queries in Yahoo’s logs. In: SIGIR ’07, New York, NY, USA, pp. 151–158. ACM, New York (2007)

    Google Scholar 

  38. Vanderlei, T., et al.: A cooperative classification mechanism for search and retrieval software components. In: SAC ’07, New York, NY, USA, pp. 866–871. ACM, New York (2007)

    Google Scholar 

  39. Yeung, P.C., Freund, L., Clarke, C.L.: X-site: a workplace search tool for software engineers. In: SIGIR ’07, New York, NY, USA. ACM, New York (2007)

    Google Scholar 

  40. Zaremski, A.M., Wing, J.M.: Specification matching of software components. ACM TOSEM 6(4), 333–369 (1997)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to George Pallis.

Additional information

This work was supported in part by the European Commission under the 7th Framework Programme through the SEARCHiN project (Marie Curie Action, contract number FP6-042467) and the Enabling Grids for E-sciencE project (contract number INFSO-RI-222667) and makes use of results produced with the EGEE (www.eu-egee.org) Grid infrastructure. The authors would like to thank EGEE users that provided characteristic queries for evaluating Minersoft.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pallis, G., Katsifodimos, A. & Dikaiakos, M.D. Searching for Software on the EGEE Infrastructure. J Grid Computing 8, 281–304 (2010). https://doi.org/10.1007/s10723-010-9155-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-010-9155-y

Keywords

Navigation