Skip to main content
Log in

Context-based literature digital collection search

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

We identify two issues with searching literature digital collections within digital libraries: (a) there are no effective paper-scoring and ranking mechanisms. Without a scoring and ranking system, users are often forced to scan a large and diverse set of publications listed as search results and potentially miss the important ones. (b) Topic diffusion is a common problem: publications returned by a keyword-based search query often fall into multiple topic areas, not all of which are of interest to users. This paper proposes a new literature digital collection search paradigm that effectively ranks search outputs, while controlling the diversity of keyword-based search query output topics. Our approach is as follows. First, during pre-querying, publications are assigned into pre-specified ontology-based contexts, and query-independent context scores are attached to papers with respect to the assigned contexts. When a query is posed, relevant contexts are selected, search is performed within the selected contexts, context scores of publications are revised into relevancy scores with respect to the query at hand and the context that they are in, and query outputs are ranked within each relevant context. This way, we (1) minimize query output topic diversity, (2) reduce query output size, (3) decrease user time spent scanning query results, and (4) increase query output ranking accuracy. Using genomics-oriented PubMed publications as the testbed and Gene Ontology terms as contexts, our experiments indicate that the proposed context-based search approach produces search results with up to 50% higher precision, and reduces the query output size by up to 70%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. PubMed, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi

  2. Gene Ontology, http://www.geneontology.org

  3. Chakrabarti S. (2003). Mining the Web, Discovering Knowledge from Hypertext Data. Morgan-Kaufmann, Los Altos, CA

    Google Scholar 

  4. Cakmak, A., Ozsoyoglu, G.: Annotating genes using textual patterns. PSB (2007)

  5. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems (1998)

  6. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. In: ACM-SIAM Symp. on Discr Alg. (1998)

  7. Ontology Lookup Service, http://www.ebi.ac.uk/ontology-lookup

  8. Po, J.: Context-based search in literature digital libraries. MS Thesis, CWRU (2006)

  9. Salton G. (1989). Automatic Text Processing. Addison-Wesley, Reading, MA

    Google Scholar 

  10. CiteSeer literature search system, http://citeseer.ist.psu.edu/cs

  11. Google Scholar, http://scholar.google.com/scholar/about.html

  12. IEEE Xplore, http://www.ieee.org/ieeexplore

  13. CaseExplorer, http://nashua.case.edu/anthexpl

  14. Chmura, J., Ratprasartporn, N., Ozsoyoglu, G.: Scalability of databases for digital libraries. ICADL pp. 435–445 (2005)

  15. Delfs, R., Doms, A., Kozlenkov, A., Schroeder, M.: GoPubMed: ontology-based literature search applied to Gene Ontology and PubMed. In: German Conference on Bioinformatics (2004)

  16. Agrawal, R., Ramakrishnan S.: Fast algorithms for mining association rules. VLDB (1994)

  17. ESearch Entrez Utility, http://eutils.ncbi.nlm.nih.gov/entrez/query/static/esearch_help.html

  18. GO Evidence Code Hierarchy, http://www.geneontology.org/GO.evidence.shtml#hier

  19. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. IJCAI (1995)

  20. Cakmak, A.: HITS- and PageRank-based importance score computations for ACM anthology papers. Technical Report, CWRU (2003)

  21. Haveliwala, T.: Topic-sensitive PageRank. WWW (2002)

  22. Aussenac-Gilles, N., Mothe, J.: Ontologies as background knowledge to explore document collections. RIAO (2004)

  23. Ratprasartporn, N., Bani-Ahmad, S., Cakmak, A., Po, J., Ozsoyoglu, G.: Evaluating utility of different score functions in a context-based environment. In: DBRank Workshop – in Conjunction with ICDE 2007

  24. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. WWW (2001)

  25. Kraft, R., Chang, C.C., Maghoul, F., Kumar, R.: Searching with context. WWW (2006)

  26. Ferragina, P., Gulli, A.: A personalized search engine based on web-snippet hierarchical clustering. WWW (2005)

  27. Al-Hamdani, A.: Querying web resources with metadata in a database. PHD Dissertation, CWRU (2004)

  28. Small H. (1973). Co-citation in the scientific literature: a new measure of the relationship between two documents. J. Am. Soc. Informat. Sci. 24(4): 28–31

    Google Scholar 

  29. Kessler M.M. (1963). Bibliographic coupling between scientific papers. Am. Documentat. 14: 10–25

    Article  Google Scholar 

  30. SWISS-Prot Keywords, http://www.expasy.org/cgi-bin/keywlist.pl

  31. The Institute of Genomic Research (TIGR), http://www.tigr.org/

  32. ACM Digital Library, http://www.acm.org/dl

  33. Open Directory Project, http://www.dmoz.org

  34. Medical Subject Heading (MeSH), http://www.nlm.nih.gov/mesh/

  35. Hawkins, D.T., Wagers, R.: Online bibliographic search strategy development. Online, May 1982

  36. Schlosser R.W., Wendt O., Bhavnani S. and Nail-Chiwetalu B. (2006). Use of information-seeking strategies for developing systematic reviews and engaging in evidence-based practice: the application of traditional and comprehensive pearl growing. A review. Int. J. Language Commun. Disorders 41(5): 567–582

    Article  Google Scholar 

  37. Porter M.F. (1980). An algorithm for suffix stripping. Program 12(3): 130–137

    Google Scholar 

  38. Baeza-Yates R. and Ribeiro-Neto B. (1999). Modern Information Retrieval. Addison Wesley, Reading, MA

    Google Scholar 

  39. Hearst, M.A.: TileBars: visualization of term distribution information in full text information access. In: Proc. of the ACM SIGCHI conference on human factor in computing systems, pp. 59–66 (1995)

  40. Kaki, M.: Findex: search results categories help users when document ranking fails. In: Proc. of the ACM SIGCHI Conference on Human Factors in Computing Systems (2005)

  41. Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: scatter/gather on retrieval results. SIGIR (1996)

  42. Zamir, O., Etzioni, O.: Grouper: a dynamic clustering interface to web search results. WWW (1999)

  43. Osinski, S., Weiss, D.: Conceptual clustering using lingo algorithm: evaluation on open directory project data. In: Advances in Soft Computing, Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM’04 Conference, Zakopane, Poland, pp. 359–368, (2004)

  44. Zeng, H., He, Q., Chen, Z., Ma, W.: learning to cluster web search results. SIGIR (2004)

  45. Zhang, D., Yong, Y.: Semantic, hierarchical, online clustering of web search results. In: Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China, April 2004

  46. Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R.: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results. WWW (2004)

  47. Lawrie, D.J., Croft, W.B.: Generating hierarchical summaries for web searches. SIGIR (2003)

  48. Vivisimo, http://vivisimo.com/

  49. Clusty, http://clusty.com/

  50. Mooter, http://www.mooter.com/

  51. Chen, M., Hearst, M.A.: Presenting web site search results in contexts: a demonstration. SIGIR (1998)

  52. Wittenburg, K., Sigman, E.: Integration of browsing, searching, and filtering in an applet for web information access. In: Proceedings of the ACM Conference on Human Factors in Computing systems, Late Breaking Track (1997)

  53. Pratt, W., Hearst, M.A., Fagan, L.M.: A knowledge-based approach to organizing retrieved documents. AAAI (1999)

  54. Muller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2 (2003)

  55. Castells, P., Fernandez, M., Vallet, D.: An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval. IEEE Trans. Knowl. Data Eng. 19(2) (2007)

  56. RDQL – A Query Language for RDF, http://www.w3.org/Submission/RDQL/

  57. Yahoo! Directory, http://dir.yahoo.com/

  58. ACM Computing Classification Systems, http://acm.org/class

  59. LINGO 3G, http://company.carrot-search.com/lingo-applications.html

  60. iBoogie, http://www.iboogie.com/Text/about.asp

  61. Pedersen, T., Pakhomov, S., Patwardhan, S., Chute, C.: Measures of semantic similarity and relatedness in the biomedical domain. J. Biomed. Informat. (2006)

  62. Lord, P.W., Stevens, R.D., Brass, A., Goble, C.A.: Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics 19(10) (2003)

  63. Maguitman, A.G., Menczer, F., Roinestad, H., Vespignani, A.: Algorithmic detection of semantic similarity. WWW (2005)

  64. Ratprasartporn, N., Ozsoyoglu, G.: Finding related papers in literature digital libraries. In: 11th European Conference on Research and Advanced Technology for Digital Libraries (ECDL) (2007)

  65. ChEBI, http://www.ebi.ac.uk/chebi/

  66. Chen Y.-L., Wei J.-J., Wu S.-Y. and Hu Y.-H. (2006). A similarity-based method for retrieving documents from the SCI/SSCI database. J. Informat. Sci. 32(5): 449–464

    Article  Google Scholar 

  67. Desai M. and Spink A. (2005). An algorithm to cluster documents based on relevance. Int. J. Informat. Process. Manage. 41(September): 1035–1049

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nattakarn Ratprasartporn.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ratprasartporn, N., Po, J., Cakmak, A. et al. Context-based literature digital collection search. The VLDB Journal 18, 277–301 (2009). https://doi.org/10.1007/s00778-008-0099-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-008-0099-9

Keywords

Navigation