Abstract
As the number and diversity of distributed Web databases on the Internet exponentially increase, it is difficult for user to know which databases are appropriate to search. Given database language models that describe the content of each database, database selection services can provide assistance in locating databases relevant to the information needs of users. In this paper, we propose a database selection approach based on statistical language modeling. The basic idea behind the approach is that, for databases that are categorized into a topic hierarchy, individual language models are estimated at different search stages, and then the databases are ranked by the similarity to the query according to the estimated language model. Two-stage smoothed language models are presented to circumvent inaccuracy due to word sparseness. Experimental results demonstrate that such a language modeling approach is competitive with current state-of-the-art database selection approaches.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Apte C, Damerau R and Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3):233–251.
Baumgarten (1997) A probabilistic model for distributed information retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, New York, pp. 258–266.
Baumgarten C (1999) A probabilistic solution to the selection and fusion problem. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, USA, pp. 246–253.
Berger A and Lafferty J (1999) Information retrieval as statistical translation. In: Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, pp. 222–229.
Callan J (2000) Distributed information retrieval. In: W.B. Croft, (Ed.), Advances in Information Retrieval. Kluwer Academic Publishers, pp. 127–150.
Callan J and Connell M (2001) Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97–130.
Cancedda N, Gaussier E, Goutte C and Renders JM (2003) Special issue on machine learning methods for text and images: Word sequence kernels. ACM Journal of Machine Learning Research, 3:1059–1082.
Craswell N, Baile P and Hawking D (2000) Server selection on the world wide web. In: Proceedings of the 5th International Conference on Digital Libraries, pp. 37–46.
D'Alession S, Murray M, Schiaffino R and Kershenbaum A (1998) Category levels in hierarchical text categorization. In: Proceedings of the Third Conference on Empirical Methods in Natural Language Processing.
David RH, Miller TL and Richard MS (1999) A Hidden Markov Model Information Retrieval System. In: Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, United States, pp. 214–221.
Dempser AP, Laird NM and Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. International Journal of the Royal Statistical Society, 39(B):1–38.
D'Souza D, Thom J and Zobel J (2000) A comparison of techniques for selecting text collections. In: Proceedings of the 11th Australasian Database Conference. Canberra, Australia, pp. 28–32.
Dumais ST and Chen H (2000) Hierarchical classification of web content. In: N. Y. ACM Press, US, Eds. Proceedings of the 23rd ACM SIGIR International Conference on Research and Development in Information Retrieval. Athens, GR, pp. 256–263.
French JC, et al. (1999) Comparing the performance of database selection algorithms. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, USA, pp. 238–245.
Gauch S, Wang G and Gomez M (1996) Profusion: Intelligent fusion from multiple, distributed search engines. The Journal of Universal Computer Science, 2(9):637–649.
Gravano L, Chang CK, GarcÍa-Molina H and Paepcke A (1997) Starts: Stanford proposal for internet meta-searching. In: Proceedin of the 1997 ACM SIGMOD International Conference on Management of Data. New York, pp. 207–218.
Gravano L, Garcia-Molina H and Tomasic A (1999) Gloss: Text-source discovery over the internet. ACM Transactions on Database Systems, 24(2):229–264.
Gravano L, Ipeirotis PG and Sahami M (2003) Qprober: A system for automatic classification of hidden-web databases. ACM Transactions on Information Systems, 21(1):1–41.
Hawking D and Thistlewaite P (1999) Methods for information server selection. ACM Transaction on Information System, 17(1):40–76.
Hiemstra D (1998) A linguistically motivated probabilistic model of information retrieval. In: Proceedings of the 2nd European Conference on Digital Libraries. Heraklion, Crete, Greece, pp. 569–584.
Ipeirotis PG and Gravano L (2002) Distributed search over the hidden web: Hierarchical database sampling and selection. In: Proceedings of the 28th International Conference on Very Large Databases. Hong Kong, China, pp. 394–405.
Ipeirotis PG and Gravano L (2004) When one sample is not enough: Improving text database selection using shrinkage. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. Paris, France, pp. 767–778.
Ipeirotis PG, Gravano L and Sahami M (2001) Probe, count, and classify: Categorizing hidden web database. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. Santa Barbara, California, USA, pp. 67–78.
Jelinek F and Mercer R (1980) Interpolated estimation of marvok source parameters from sparse data. In: Patter Recognition in Practices. Amsterdam, Holland, pp. 381–402.
Jin R, Hauptman A and Zhai C (2002) Title language model for information retrieval. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Tampere, Finland, pp. 42–48.
Koller D and Sahami M. (1997) Hierarchically classifying documents using very few words. In: Proceedings of the Fourteenth International Conference on Machine Learning. Nashville, Tennessee, USA, pp. 170–178.
Kullback S and Leibler RA (1951) On information and sufficiency. Annals of Mathematical Statistics, 22:76–88.
Lafferty J and Zhai C (2001) Document language models, query models, and risk minimization for information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New Orleans, Louisiana, USA, pp. 111–119.
Lewis DD, Yang Y, Rose TG and Li F (2004) Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397.
Manber U and Bigot P (1997) The search broker. In: Proceedings of USENIX Symposium on Internet Technologies and System. Monterey, California.
Meng W, Liu KL, Yu C, Wang X and Chang Y (1998) Determining text databases to search in the internet. In: Proceedings of the 24th International Conference on Very Large Data Bases. New York, USA, pp. 14–25.
Meng W, Wang W, Sun H and Yu C (2002) Concept hierarchy based text database categorization. Journal of Knowledge and Information Systems, 4(2):132–150.
Miller DJ, Leek T and Schwartz RM (1999) A hidden markov model information retrieval system. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, USA, pp. 214–221.
Mood AM and Graybill FA (1963) Introduction to the Theory of Statistics 2th Ed., McGraw-Hill.
Ponte JM and Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Melbourne, Australia, pp. 214–221.
Porter M (1980) An algorithm for suffix stripping. Program, 14(3):130–137.
Powell AL and French JC (2003) Comparing the performance of collection selection algorithms. ACM Transactions on Information systems, 21(4):412–456.
Robertson SE (1977) The probabilistic ranking principles in IR. International Journal on Document, 33:294–304.
Robertson SE and Sparck Jones K (1976) Relevance weighting of search terms. Journal of American Society of Information Science, 27:129–146.
Salton G and McGill M (1983) Introduction of modern information retrieval. McGrag-Hill, New York.
Si L, Jin R, Callan J and Ogilivie P (2002) A language modeling framework for resource selection and results merging. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management. McLean, Virginia, USA, pp. 391–397.
Si L and Callan J (2003) The effect of database size distribution on resource selection algorithms. In: Proceedings of SIGIR 2003 Workshop on Distributed Information Retrieval. Toronto, Canada, pp. 31–42.
Song F and Croft WB (1998) A general language model for information retrieval. In: Proceedings of the Eighth International Conference on Information and Knowledge Management. Kansas City, Missouri, USA, pp. 316–321.
Turtle H and Croft WB (1990) Inference network for document retrieval. In: Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, pp. 1–24.
Van Rijsbergen CJ (1989) Towards an information logic. In: Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, pp. 77–86.
Van Rijsbergen CJ (1992) Probabilistic retrieval revisited. International Journal of Computation, 35:291–298.
Voorhees E, Gupta NK and Johnson-Laird B (1995) Learning collection fusion strategies. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, pp. 172–179.
Weighend AS, Wiener ED and Pedersen JO (1999) Exploiting hierarchy in text categorization. Information Retrieval, 1(3):193–216.
Wong SKM, Ziarko W, Raghavan VV and Wong PCH (1987) On modeling of informtion retrieval concepts in vector space. ACM Transaction Database System, 12:229–321.
Xu J and Croft WB (1999) Cluster-based language models for distributed retrieval. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, USA, pp. 254–261.
Yang H and Zhang M (2004) Hierarchical classification for multiple, distributed web databases. International Journal of Computers and Their Applications, 11(2):118–130.
Yu C, Liu K, Wu W, Meng W and Rishe N (1999a), “A methodology to retrieve text documents from multiple databases,” Technical report, University of Illinois at Chicago.
Yu C, Meng W, Liu KL, Wu W and Rishe N (1999b) Efficient and effective metasearch for a large number of text databases. In: Proceedings of the Eighth International Conference on Information and Knowledge Management. Kansas City, Missouri, USA, pp. 217–224.
Yuwono B and Lee DL (1997) Server ranking for distributed text retrieval systems on internet. In: Proceedings of the Conference on Database Systems for Advanced Applications, pp. 41–49.
Zhai C and Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New Orleans, Louisiana, United States, pp. 334–342.
Zaragoza H, Hiemstra D and Tipping M (2003) Bayesian extension to the language model for ad hoc information retrieval. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. Toronto, Canada, pp. 4–9.
Zobel J (1997) Collection selection via lexicon inspection. In: Proceedings of the 2nd Australian Document Computing Symposium. Melbourne, Australia, pp. 74–80.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yang, H., Zhang, M. Two-stage statistical language models for text database selection. Inf Retrieval 9, 5–31 (2006). https://doi.org/10.1007/s10791-005-5719-z
Received:
Revised:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10791-005-5719-z