Skip to main content
Log in

Modelling retrieval models in a probabilistic relational algebra with a new operator: the relational Bayes

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

This paper presents a probabilistic relational modelling (implementation) of the major probabilistic retrieval models. Such a high-level implementation is useful since it supports the ranking of any object, it allows for the reasoning across structured and unstructured data, and it gives the software (knowledge) engineer control over ranking and thus supports customisation. The contributions of this paper include the specification of probabilistic SQL (PSQL) and probabilistic relational algebra (PRA), a new relational operator for probability estimation (the relational Bayes), the probabilistic relational modelling of retrieval models, a comparison of modelling retrieval with traditional SQL versus modelling retrieval with PSQL, and a comparison of the performance of probability estimation with traditional SQL versus PSQL. The main findings are that the PSQL/PRA paradigm allows for the description of advanced retrieval models, is suitable for solving large-scale retrieval tasks, and outperforms traditional SQL in terms of abstraction and performance regarding probability estimation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abiteboul, S., Agrawal, R., Bernstein, P., Carey, M., Ceri, S., Croft, B., DeWitt, D., Franklin, M., Garcia-Molina, H., Gawlick, D., Gray, J., Haas, L., Halevy, A., Hellerstein, J., Ioannidis, Y., Kersten, M., Pazzani, M., Lesk, M., Maier, D., Naughton, J., Schek, H., Sellis, T., Silberschatz, A., Stonebraker, M., Snodgrass, R., Ullman, J., Weikum, G., Widom, J., Zdonik, S.: The lowell database research self assessment (2003)

  2. Agrawal, S., Chaudhuri, S., Das, G.: Dbxplorer: a system for keyword-based search over relational databases. In: ICDE, pp. 5–16 (2002)

  3. Agrawal, S., Chaudhuri, S., Das, G., Gionis, A.: Automated ranking of database query results. In: CIDR (2003)

  4. Azzam, H., Roelleke, T.: Efficient processing of ontological queries. In: 2nd VLDB Workshop on Ontologies-based Techniques for Databases and Information Systems, Seoul (2006)

  5. Amati G. and van Rijsbergen C.J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4): 357–389

    Article  Google Scholar 

  6. Bosc P., Galibourg M. and Hamon G. (1988). Fuzzy querying with sql: extensions and implementation aspects. Fuzzy Sets Syst. 28(3): 333–349

    Article  MATH  MathSciNet  Google Scholar 

  7. Barbara D., Garcia-Molina H. and Porter D. (1990). A probabilistic relational data model. In: Bancilhon, F., Thanos, C., and Tsichrizis, D. (eds) Advances in Database Technology – EDBT ’90, pp 60–74. Springer, Berlin

    Chapter  Google Scholar 

  8. Barbara D., Garcia-Molina H. and Porter D. (1992). The management of probabilistic data. IEEE Trans. Knowl. Data Eng. 4(5): 487–502

    Article  Google Scholar 

  9. Berger, A., Lafferty, J.: Information retrieval as statistical translation. In: SIGIR (ed.) SIGIR ’99 Proceedings of the 22nd International Conference on Research and Development in Information Retrieval, pp. 222–229, ACM, New York (1999)

  10. Bosc, P., Pivert, O.: Fuzzy queries and relational databases. In: Proceedings of the 1994 ACM Symposium on Applied Computing, pp. 170–174 ACM Press, New York (1994)

  11. Chaudhuri, S., Das, G., Hristidis, V., Weikum, G.: Probabilistic ranking of database query results. In: VLDB pp. 888–899 (2004)

  12. Chaudhuri S., Das G., Hristidis V. and Weikum G. (2006). Probabilistic information retrieval approach for ranking of database query results. ACM Trans. Database Syst. 31(3): 1134–1168

    Article  Google Scholar 

  13. Croft W.B. and Harper D.J. (1979). Using probabilistic models of document retrieval without relevance information. J. Doc. 35: 285–295

    Google Scholar 

  14. Cavallo, R., Pittarelli, M.: The theory of probabilistic databases. In: Proceedings of the 13th International Conference on Very Large Databases, pp. 71–81 Morgan Kaufman, Los Altos (1987)

  15. Chaudhuri, S., Ramakrishnan, R., Weikum, G.: Integrating db and ir technologies: What is the sound of one hand clapping? In: CIDR, pp. 1–12 (2005)

  16. Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: VLDB, pp. 864–875 (2004)

  17. Dalvi, N.N., Suciu, D.: Answering queries from statistics and probabilistic views. In: VLDB, pp. 805–816 (2005)

  18. de Vries, A., Roelleke, T.: Relevance information: a loss of entropy but a gain for idf? In: ACM SIGIR, Salvador, Brazil (2005)

  19. Ercegovac, V., DeWitt, D.J., Ramakrishnan, R.: The texture benchmark: Measuring performance of text queries on a relational dbms. In: VLDB, pp. 313–324 (2005)

  20. Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems. Addison-Wesley, Reading (2000)

  21. Frommholz I. and Fuhr N. (2006). Probabilistic, object-oriented logics for annotation-based retrieval in digital libraries. In: Marchionini, G., Nelson, M.L. and Marshall, C.C. (eds) JCDL., pp 55–64. ACM, New York

    Chapter  Google Scholar 

  22. Fagin R., Lotem A. and Naor M. (2003). Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4): 614–656

    Article  MATH  MathSciNet  Google Scholar 

  23. Fuhr, N., Roelleke, T.: A probabilistic NF2 relational algebra for integrated information retrieval and database systems. In: Tanik, M.M., Bastani, F.B., Gibson, D., Fielding, P.J. (eds.) Proceedings of the 2nd World Conference on Integrated Design and Process Technology (IDPT), Society for Design and Process Science (SDPS), Austin, pp. 17–30 (1996)

  24. Fuhr N. and Rölleke T. (1997). A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Info. Syst. 14(1): 32–66

    Article  Google Scholar 

  25. Forst, J.F., Tombros, A., Roelleke, T.: solving the enterprise trec task with probabilistic data models. In: Proceedings of TREC 2006, pp. xx–yy (2006)

  26. Fuhr N. (1990). A probabilistic framework for vague queries and imprecise information in databases. In: McLeod, D., Sacks-Davis, R. and Schek, H. (eds) Proceedings of the 16th International Conference on Very Large Databases., pp 696–707. Morgan Kaufman, Los Altos

    Google Scholar 

  27. Fuhr N. (1995). Probabilistic datalog—a logic for powerful retrieval methods. In: Fox, E.A., Ingwersen, P. and Fidel, R. (eds) Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 282–290. ACM, New York

    Chapter  Google Scholar 

  28. Grabs, T., Böhm, K., Schek, H.-J.: Powerdb-ir - information retrieval on top of a database cluster. In: CIKM, pp. 411–418 (2001)

  29. Grabs T., Böhm K. and Schek H.-J. (2004). Powerdb-ir - scalable information retrieval and storage with a cluster of databases. Knowl. Inf. Syst. 6(4): 465–505

    Article  Google Scholar 

  30. Grossman D.A. and Frieder O. (1998). Information Retrieval: Algorithms and Heuristics. Kluwer, Massachusetts

    MATH  Google Scholar 

  31. Grossman D.A. and Frieder O. (2004). Information Retrieval. Algorithms and Heuristics, 2nd edn. vol. 15 of The Information Retrieval Series. Springer, Berlin Heidelberg

    Google Scholar 

  32. Grossman D.A., Frieder O., Holmes D.O. and Roberts D.C. (1997). Integrating structured data and text: a relational approach. JASIS 48(2): 122–132

    Article  Google Scholar 

  33. Hristidis, V., Gravano, L., Papakonstantinou, Y.: Efficient ir-style keyword search over relational databases. In: VLDB, pp. 850–861 (2003)

  34. Hiemstra D. (2000). A probabilistic justification for using tf.idf term weighting in information retrieval. Int. J. Digit. Libr. 3(2): 131–139

    Article  Google Scholar 

  35. Hristidis, V., Papakonstantinou, Y.: Discover: keyword search in relational databases. In: VLDB, pp. 670–681 (2002)

  36. Li, C., Chang, K.C.-C., Ilyas, I.F., Song, S.: Ranksql: Query algebra and optimization for relational top-k queries. In: SIGMOD Conference, pp. 131–142 (2005)

  37. Lee, S.K.: An extended relational database model for uncertain and imprecise information. In: Proceedings of the 18th VLDB Conference, pp. 211–220 Morgan Kaufman, Los Altos (1992)

  38. Lakshmanan L.V.S., Leone N., Ross R. and Subrahmanian V.S. (1997). Probview: a flexible probabilistic database system. ACM Trans. Database Syst. 22(3): 419–469

    Article  Google Scholar 

  39. Lalmas, M., Roelleke, T.: Modelling vague content and structure querying in XML retrieval with a probabilistic object-relational framework. In: Proceedings of the 6th International Conference on Flexible Query Answering Systems (FQAS), LNCS, Lyon, Springer, Berlin (2004)

  40. Lalmas M., Roelleke T. and Fuhr N. (2002). Intelligent hypermedia retrieval. In: Szczepaniak, P.S., Segovia, F., and Zadeh, L.A. (eds) Intelligent Exploration of the Web, pp. Springer, Heidelberg

    Google Scholar 

  41. Lafferty, J., Zhai, C.: Probabilistic Relevance Models Based on Document and Query Generation, chap. 1. Kluwer (2002)

  42. Macleod I.A. (1991). Text retrieval and the relational model. J. Am. Soc. Inf. Sci. 42(3): 155–165

    Article  Google Scholar 

  43. Maron M.E. and Kuhns J.L. (1960). On relevance, probabilistic indexing and information retrieval. J. ACM 7: 216–244

    Article  Google Scholar 

  44. Motro A. (1988). Vague: a user interface to relational databases that permits vague queries. ACM Trans. Off. Inf. Syst. 6(3): 187–214

    Article  Google Scholar 

  45. Motro A. (1990). Accommodating imprecision in database systems: Issues and solutions. Sigmod rec. 19(4): 69

    Article  Google Scholar 

  46. Niemi T. and Järvelin K. (1995). A straightforward NF2 relational interface with applications in information retrieval. Inf. Process. Manage. 31(2): 215–231

    Article  Google Scholar 

  47. Ponte J.M. and Croft W.B. (1998). A language modeling approach to information retrieval. In: Bruce Croft, W., Moffat, A., van Rijsbergen, C.J., Wilkinson, R. and Zobel, J. (eds) Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 275–281. ACM, New York

    Chapter  Google Scholar 

  48. Pfeifer, U., Fuhr, N.: Efficient processing of vague queries using a data stream approach. In: Proceedings of the 18th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 189–198, New York (1995)

  49. Roelleke T., Lübeck R. and Kazai G. (2001). The HySpirit retrieval platform, demonstration. In: Croft, B., Harper, D.J., Kraft, D.H., and Zobel, J. (eds) Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New Orleans

    Google Scholar 

  50. Robertson, S.E.: Term frequency and term value. In: SIGIR, pp. 22–29 (1981)

  51. Robertson S.E. (2004). Understanding inverse document frequency: On theoretical arguments for idf. J. Doc. 60: 503–520

    Article  Google Scholar 

  52. Roelleke T. (1999). POOL: Probabilistic Object-Oriented Logical Representation and Retrieval of Complex Objects. Shaker Verlag, Aachen Dissertation

    Google Scholar 

  53. Roelleke, T.: A frequency-based and a Poisson-based probability of being informative. In: Callan, J., Cormarck, G., Clarke, C., Hawking, D., Smeaton, A. (eds.) Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 227–234 (2003)

  54. Ross, J.G.R., Subrahmanian, V.S.: Probabilistic aggregates. In: 13th International Symposium on Methodologies for Intelligent Systems (ISMIS), Lyon, Foundations of Intelligent Systems. Springer, Heidelberg (2002)

  55. Robertson S.E. and Sparck Jones K. (1976). Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27: 129–146

    Article  Google Scholar 

  56. Roelleke, T., Wang, J.: A parallel derivation of probabilistic information retrieval models. In: ACM SIGIR (2006)

  57. Robertson S.E., Walker S. and Hancock-Beaulieu M.M. (1995). Large test collection experiments on an operational interactive system: Okapi at TREC. Inf. Process. Manage. 31: 345–360

    Article  Google Scholar 

  58. Suciu, D., Dalvi, N.N.: Foundations of probabilistic answers to queries. In: SIGMOD Conference, p. 963 (2005)

  59. Silberschatz, A., Korth, H.F., Sudarshan, S.: Database Systems Concepts, 4th Edn. McGraw-Hill Higher Education (2002)

  60. Schek, H.-J., Pistor, P.: Data structures for an integrated database management and information retrieval system. In: Proceedings of the 8th International Conference on Very Large Data Bases, pp. 197–207, Morgan Kaufman, Los Altos (1982)

  61. Turtle H. and Croft W.B. (1990). Inference networks for document retrieval. In: Vidick, J.-L. (eds) Proceedings of the 13th International Conference on Research and Development in Information Retrieval., pp 1–24. ACM, New York

    Chapter  Google Scholar 

  62. Theobald, M., Weikum, G., Schenkel, R.: Top-k query evaluation withprobabilistic guarantees. In: VLDB, pp. 648–659 (2004)

  63. van Rijsbergen C.J. (1986). A non-classical logic for information retrieval. Comput. J. 29(6): 481–485

    Article  MATH  Google Scholar 

  64. Wong S.K.M. and Yao Y.Y. (1995). On modeling information retrieval with probabilistic inference. ACM Trans. Inf. Sys. 13(1): 38–68

    Article  Google Scholar 

  65. Yu C.T., Lam K. and Salton G. (1982). Term weighting in information retrieval using the term precision model. J. ACM 29(1): 152–170

    Article  MATH  MathSciNet  Google Scholar 

  66. Zhai, C.X., Lafferty, J.D.: Two-stage language models for information retrieval. In: SIGIR, pp. 49–56 (2002)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Roelleke.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Roelleke, T., Wu, H., Wang, J. et al. Modelling retrieval models in a probabilistic relational algebra with a new operator: the relational Bayes. The VLDB Journal 17, 5–37 (2008). https://doi.org/10.1007/s00778-007-0073-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-007-0073-y

Keywords

Navigation