Abstract
Several large-scale Grid infrastructures are currently in operation around the world, federating an impressive collection of computational resources, a wide variety of application software, and hundreds of user communities. To better serve the current and prospective users of Grid infrastructures, it is important to develop advanced software retrieval services that could help users locate software components suitable to their needs. In this paper, we present the design and implementation of Minersoft, a distributed, multi-threaded harvester for application software located in large-scale Grid infrastructures. Minersoft crawls the sites of a Grid infrastructure, discovers installed software resources, annotates them with keyword-rich metadata, and creates inverted indexes that can be used to support full-text software retrieval. We present insights derived from the implementation and deployment of Minersoft on EGEE, one of the largest Grid production services currently in operation. Experimental results show that Minersoft achieves a high performance in crawling EGEE sites and discovering software-related files, and a high efficiency in supporting software retrieval.
Similar content being viewed by others
References
Enabling Grids for E-sciencE project.: http://www.eu-egee.org/. Last accessed: February 2010
teragrid.: http://www.teragrid.org/index.php. Last accessed: February 2010
Agrawal, R., et al.: The Claremont report on database research. SIGMOD Rec. 37(3), 9–19 (2008)
Al-Maskari, A., Sanderson, M., Clough, P.: The relationship between IR effectiveness measures and user satisfaction. In: SIGIR ’07, New York, NY, USA, pp. 773–774 (2007)
Ames, A., Maltzahn, C., Bobb, N., Miller, E.L., Brandt, S.A., Neeman, A., Hiatt, A. Tuteja, D.: Richer file system metadata using links and attributes. In: MSST ’05, Washington, DC, USA, pp. 49–60. IEEE Computer Society, Washington, DC (2005)
Antoniol, G., Canfora, G., Casazza, G., Lucia, A.D., Merlo, E.: Recovering traceability links between code and documentation. IEEE Trans. Softw. Eng. 28(10), 970–983 (2002)
Bao, S., Xue, G., Wu, X., Yu, Y., Fei, B., Su, Z.: Optimizing web search using social annotations. In: WWW ’07, New York, NY, USA, pp. 501–510. ACM, New York (2007)
Bass, L., Clements, P., Kazman, R., Klein, M.: Evaluating the software architecture competence of organizations. In: WICSA ’08, pp. 249–252 (2008)
Borodin, A., Roberts, G.O., Rosenthal, J.S., Tsaparas, P.: Link analysis ranking: algorithms, theory, and experiments. ACM TOIT 5(1), 231–297 (2005)
Brochu, F., Egede, U., Elmsheuser, J., Harrison, K., et al.: Ganga: a tool for computational-task management and easy access to Grid resources. Comput. Phys. Commun. 180, 2303–2316 (2009). http://ganga.web.cern.ch/ganga/documents/index.php
Clarke, C.L., et al.: Novelty and diversity in information retrieval evaluation. In: SIGIR ’08, New York, NY, USA, pp. 659–666. ACM, New York (2008)
Cohen, S., Domshlak, C., Zwerdling, N.: On ranking techniques for desktop search. ACM TOIS 26(2), 1–24 (2008)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of OSDI ’04: 6th Symposium on Operating System Design and Implementation, pp. 137–150. Usenix Association, Berkeley (2004)
Dikaiakos, M.D., Sakellariou, R., Ioannidis, Y.: Information services for large-scale grids: a case for a Grid search engine. In: Chapter Engineering the Grid: Status and Perspectives, pp. 571–585. American Scientific, Stevenson Ranch (2006)
Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the Grid: enabling scalable virtual organizations. Int. J. Supercomput. Appl. 15(3), 200–222 (2001)
Gifford, D.K., Jouvelot, P., Sheldon, M.A., O’Toole, J.J.W.: Semantic file systems. In: SOSP ’91, New York, NY, USA, pp. 16–25. ACM, New York (1991)
Gyllstrom, K.A., Soules, C., Veitch, A.: Confluence: enhancing contextual desktop search. In: SIGIR ’07, New York, NY, USA, pp. 717–718. ACM, New York (2007)
Hummel, O., Atkinson, C.: Extreme harvesting: test driven discovery and reuse of software components. In: Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, IRI—2004, Las Vegas Hilton, Las Vegas, NV, USA, pp. 66–72 (2004)
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)
Kao, H.-Y., Lin, S.-F.: A fast pagerank convergence method based on the cluster prediction. In: WI ’07, Washington, DC, USA, pp. 593–599. IEEE, Washington (2007)
Katsaros, D., Pallis, G., Stamos, K., Vakali, A., Sidiropoulos, A., Manolopoulos, Y.: CDNs content outsourcing via generalized communities. IEEE TKDE 21(1), 137–151 (2009)
Katsifodimos, A., Pallis, G., Dikaiakos, D.M.: Harvesting large-scale grids for software resources. In: CCGRID ’09, Shanghai, China. IEEE Computer Society, Shanghai (2009)
Khemakhem, S., Drira, K., Jmaiel, M.: Sec+: an enhanced search engine for component-based software development. SIGSOFT Softw. Eng. Notes 32(4), 4 (2007)
Koren, J., Leung, A., Zhang, Y., Maltzahn, C., Ames, S., Miller, E.: Searching and navigating petabyte-scale file systems based on facets. In: PDSW ’07, pp. 21–25 (2007)
Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: densification and shrinking diameters. ACM TKDD 1(1):2 (2007)
Li, G., Ooi, B.C., Feng, J., Wang, J., Zhou, L.: Ease: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In: SIGMOD 2008, New York, NY, USA, pp. 903–914. ACM, New York (2008)
Linstead, E., Bajracharya, S., Ngo, T., Rigor, P., Lopes, C., Baldi, P.: Sourcerer: mining and searching internet-scale software repositories. Data Min Knowl Discov 18(2), 300–336 (2009)
Lucia, A.D., Fasano, F., Oliveto, R., Tortora, G.: Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans. Softw. Eng. Methodol. 16(4), 13 (2007)
Lucrédio, D., do Prado, A.F., de Almeida, E.S.: A survey on software components search and retrieval. In: Proceedings of the 30th Euromicro Conference, pp. 152–159 (2004)
Maarek, Y.S., Berry, D.M., Kaiser, G.E.: An information retrieval approach for automatically constructing software libraries. IEEE Trans. Softw. Eng. 17(8), 800–813 (1991)
Marcus, A., Maletic, J.: Recovering documentation-to-source-code traceability links using latent semantic indexing. In: ICSE 2003, pp. 125–135 (2003)
Matsushita, M.: Ranking significance of software components based on use relations. IEEE Trans. Softw. Eng. 31(3), 213–225 (2005)
Pallis, G., Katsifodimos, A., Dikaiakos, D.M.: Effective keyword search for software resources installed in large-scale Grid infrastructures. In: 2009 IEEE/WIC/ACM International Conference on Web Intelligence, Milano, Italy (2009)
Ramage, D., Heymann, P., Manning, C.D., Garcia-Molina, H.: Clustering the tagged web. In: WSDM ’09, New York, NY, USA, pp. 54–63. ACM, New York (2009)
Robinson, D., Sung, I., Williams, N.: File systems, unicode, and normalization. In: Unicode ’06 (2006)
Soules, C.A.N., Ganger, G.R.: Connections: using context to enhance file search. SIGOPS Oper. Syst. Rev. 39(5), 119–132 (2005)
Teevan, J., Adar, E., Jones, R., Potts, M.A.S.: Information re-retrieval: repeat queries in Yahoo’s logs. In: SIGIR ’07, New York, NY, USA, pp. 151–158. ACM, New York (2007)
Vanderlei, T., et al.: A cooperative classification mechanism for search and retrieval software components. In: SAC ’07, New York, NY, USA, pp. 866–871. ACM, New York (2007)
Yeung, P.C., Freund, L., Clarke, C.L.: X-site: a workplace search tool for software engineers. In: SIGIR ’07, New York, NY, USA. ACM, New York (2007)
Zaremski, A.M., Wing, J.M.: Specification matching of software components. ACM TOSEM 6(4), 333–369 (1997)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported in part by the European Commission under the 7th Framework Programme through the SEARCHiN project (Marie Curie Action, contract number FP6-042467) and the Enabling Grids for E-sciencE project (contract number INFSO-RI-222667) and makes use of results produced with the EGEE (www.eu-egee.org) Grid infrastructure. The authors would like to thank EGEE users that provided characteristic queries for evaluating Minersoft.
Rights and permissions
About this article
Cite this article
Pallis, G., Katsifodimos, A. & Dikaiakos, M.D. Searching for Software on the EGEE Infrastructure. J Grid Computing 8, 281–304 (2010). https://doi.org/10.1007/s10723-010-9155-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-010-9155-y