Two-stage statistical language models for text database selection

Yang, Hui; Zhang, Minjie

doi:10.1007/s10791-005-5719-z

Two-stage statistical language models for text database selection

Published: January 2006

Volume 9, pages 5–31, (2006)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Two-stage statistical language models for text database selection

Download PDF

Hui Yang¹ &
Minjie Zhang¹

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

As the number and diversity of distributed Web databases on the Internet exponentially increase, it is difficult for user to know which databases are appropriate to search. Given database language models that describe the content of each database, database selection services can provide assistance in locating databases relevant to the information needs of users. In this paper, we propose a database selection approach based on statistical language modeling. The basic idea behind the approach is that, for databases that are categorized into a topic hierarchy, individual language models are estimated at different search stages, and then the databases are ranked by the similarity to the query according to the estimated language model. Two-stage smoothed language models are presented to circumvent inaccuracy due to word sparseness. Experimental results demonstrate that such a language modeling approach is competitive with current state-of-the-art database selection approaches.

References

Apte C, Damerau R and Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3):233–251.
Google Scholar
Baumgarten (1997) A probabilistic model for distributed information retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, New York, pp. 258–266.
Google Scholar
Baumgarten C (1999) A probabilistic solution to the selection and fusion problem. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, USA, pp. 246–253.
Berger A and Lafferty J (1999) Information retrieval as statistical translation. In: Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, pp. 222–229.
Callan J (2000) Distributed information retrieval. In: W.B. Croft, (Ed.), Advances in Information Retrieval. Kluwer Academic Publishers, pp. 127–150.
Callan J and Connell M (2001) Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97–130.
Article Google Scholar
Cancedda N, Gaussier E, Goutte C and Renders JM (2003) Special issue on machine learning methods for text and images: Word sequence kernels. ACM Journal of Machine Learning Research, 3:1059–1082.
MathSciNet Google Scholar
Craswell N, Baile P and Hawking D (2000) Server selection on the world wide web. In: Proceedings of the 5th International Conference on Digital Libraries, pp. 37–46.
D'Alession S, Murray M, Schiaffino R and Kershenbaum A (1998) Category levels in hierarchical text categorization. In: Proceedings of the Third Conference on Empirical Methods in Natural Language Processing.
David RH, Miller TL and Richard MS (1999) A Hidden Markov Model Information Retrieval System. In: Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, United States, pp. 214–221.
Dempser AP, Laird NM and Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. International Journal of the Royal Statistical Society, 39(B):1–38.
Google Scholar
D'Souza D, Thom J and Zobel J (2000) A comparison of techniques for selecting text collections. In: Proceedings of the 11th Australasian Database Conference. Canberra, Australia, pp. 28–32.
Dumais ST and Chen H (2000) Hierarchical classification of web content. In: N. Y. ACM Press, US, Eds. Proceedings of the 23rd ACM SIGIR International Conference on Research and Development in Information Retrieval. Athens, GR, pp. 256–263.
French JC, et al. (1999) Comparing the performance of database selection algorithms. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, USA, pp. 238–245.
Gauch S, Wang G and Gomez M (1996) Profusion: Intelligent fusion from multiple, distributed search engines. The Journal of Universal Computer Science, 2(9):637–649.
Google Scholar
Gravano L, Chang CK, GarcÍa-Molina H and Paepcke A (1997) Starts: Stanford proposal for internet meta-searching. In: Proceedin of the 1997 ACM SIGMOD International Conference on Management of Data. New York, pp. 207–218.
Gravano L, Garcia-Molina H and Tomasic A (1999) Gloss: Text-source discovery over the internet. ACM Transactions on Database Systems, 24(2):229–264.
Article Google Scholar
Gravano L, Ipeirotis PG and Sahami M (2003) Qprober: A system for automatic classification of hidden-web databases. ACM Transactions on Information Systems, 21(1):1–41.
Article Google Scholar
Hawking D and Thistlewaite P (1999) Methods for information server selection. ACM Transaction on Information System, 17(1):40–76.
Google Scholar
Hiemstra D (1998) A linguistically motivated probabilistic model of information retrieval. In: Proceedings of the 2nd European Conference on Digital Libraries. Heraklion, Crete, Greece, pp. 569–584.
Ipeirotis PG and Gravano L (2002) Distributed search over the hidden web: Hierarchical database sampling and selection. In: Proceedings of the 28th International Conference on Very Large Databases. Hong Kong, China, pp. 394–405.
Ipeirotis PG and Gravano L (2004) When one sample is not enough: Improving text database selection using shrinkage. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. Paris, France, pp. 767–778.
Ipeirotis PG, Gravano L and Sahami M (2001) Probe, count, and classify: Categorizing hidden web database. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. Santa Barbara, California, USA, pp. 67–78.
Jelinek F and Mercer R (1980) Interpolated estimation of marvok source parameters from sparse data. In: Patter Recognition in Practices. Amsterdam, Holland, pp. 381–402.
Jin R, Hauptman A and Zhai C (2002) Title language model for information retrieval. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Tampere, Finland, pp. 42–48.
Koller D and Sahami M. (1997) Hierarchically classifying documents using very few words. In: Proceedings of the Fourteenth International Conference on Machine Learning. Nashville, Tennessee, USA, pp. 170–178.
Kullback S and Leibler RA (1951) On information and sufficiency. Annals of Mathematical Statistics, 22:76–88.
MathSciNet Google Scholar
Lafferty J and Zhai C (2001) Document language models, query models, and risk minimization for information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New Orleans, Louisiana, USA, pp. 111–119.
Lewis DD, Yang Y, Rose TG and Li F (2004) Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397.
Google Scholar
Manber U and Bigot P (1997) The search broker. In: Proceedings of USENIX Symposium on Internet Technologies and System. Monterey, California.
Meng W, Liu KL, Yu C, Wang X and Chang Y (1998) Determining text databases to search in the internet. In: Proceedings of the 24th International Conference on Very Large Data Bases. New York, USA, pp. 14–25.
Meng W, Wang W, Sun H and Yu C (2002) Concept hierarchy based text database categorization. Journal of Knowledge and Information Systems, 4(2):132–150.
Google Scholar
Miller DJ, Leek T and Schwartz RM (1999) A hidden markov model information retrieval system. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, USA, pp. 214–221.
Mood AM and Graybill FA (1963) Introduction to the Theory of Statistics 2th Ed., McGraw-Hill.
Ponte JM and Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Melbourne, Australia, pp. 214–221.
Porter M (1980) An algorithm for suffix stripping. Program, 14(3):130–137.
Google Scholar
Powell AL and French JC (2003) Comparing the performance of collection selection algorithms. ACM Transactions on Information systems, 21(4):412–456.
Article Google Scholar
Robertson SE (1977) The probabilistic ranking principles in IR. International Journal on Document, 33:294–304.
Google Scholar
Robertson SE and Sparck Jones K (1976) Relevance weighting of search terms. Journal of American Society of Information Science, 27:129–146.
Google Scholar
Salton G and McGill M (1983) Introduction of modern information retrieval. McGrag-Hill, New York.
Google Scholar
Si L, Jin R, Callan J and Ogilivie P (2002) A language modeling framework for resource selection and results merging. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management. McLean, Virginia, USA, pp. 391–397.
Si L and Callan J (2003) The effect of database size distribution on resource selection algorithms. In: Proceedings of SIGIR 2003 Workshop on Distributed Information Retrieval. Toronto, Canada, pp. 31–42.
Song F and Croft WB (1998) A general language model for information retrieval. In: Proceedings of the Eighth International Conference on Information and Knowledge Management. Kansas City, Missouri, USA, pp. 316–321.
Turtle H and Croft WB (1990) Inference network for document retrieval. In: Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, pp. 1–24.
Van Rijsbergen CJ (1989) Towards an information logic. In: Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, pp. 77–86.
Van Rijsbergen CJ (1992) Probabilistic retrieval revisited. International Journal of Computation, 35:291–298.
MATH Google Scholar
Voorhees E, Gupta NK and Johnson-Laird B (1995) Learning collection fusion strategies. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, pp. 172–179.
Weighend AS, Wiener ED and Pedersen JO (1999) Exploiting hierarchy in text categorization. Information Retrieval, 1(3):193–216.
Google Scholar
Wong SKM, Ziarko W, Raghavan VV and Wong PCH (1987) On modeling of informtion retrieval concepts in vector space. ACM Transaction Database System, 12:229–321.
Google Scholar
Xu J and Croft WB (1999) Cluster-based language models for distributed retrieval. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, USA, pp. 254–261.
Yang H and Zhang M (2004) Hierarchical classification for multiple, distributed web databases. International Journal of Computers and Their Applications, 11(2):118–130.
Google Scholar
Yu C, Liu K, Wu W, Meng W and Rishe N (1999a), “A methodology to retrieve text documents from multiple databases,” Technical report, University of Illinois at Chicago.
Yu C, Meng W, Liu KL, Wu W and Rishe N (1999b) Efficient and effective metasearch for a large number of text databases. In: Proceedings of the Eighth International Conference on Information and Knowledge Management. Kansas City, Missouri, USA, pp. 217–224.
Yuwono B and Lee DL (1997) Server ranking for distributed text retrieval systems on internet. In: Proceedings of the Conference on Database Systems for Advanced Applications, pp. 41–49.
Zhai C and Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New Orleans, Louisiana, United States, pp. 334–342.
Zaragoza H, Hiemstra D and Tipping M (2003) Bayesian extension to the language model for ad hoc information retrieval. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. Toronto, Canada, pp. 4–9.
Zobel J (1997) Collection selection via lexicon inspection. In: Proceedings of the 2nd Australian Document Computing Symposium. Melbourne, Australia, pp. 74–80.

Download references

Author information

Authors and Affiliations

School of Information Technology and Computer Science, University of Wollongong, Wollongong, 2500, Australia
Hui Yang & Minjie Zhang

Authors

Hui Yang
View author publications
You can also search for this author inPubMed Google Scholar
Minjie Zhang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Hui Yang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, H., Zhang, M. Two-stage statistical language models for text database selection. Inf Retrieval 9, 5–31 (2006). https://doi.org/10.1007/s10791-005-5719-z

Download citation

Received: 11 May 2004
Revised: 15 October 2004
Accepted: 18 October 2004
Issue Date: January 2006
DOI: https://doi.org/10.1007/s10791-005-5719-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Two-stage statistical language models for text database selection

Abstract

Article PDF

Similar content being viewed by others

Expanding Queries with Maximum Likelihood Estimators and Language Models

A survey of statistical approaches for query expansion

Keyword-Based Search Over Databases: A Roadmap for a Reference Architecture Paired with an Evaluation Framework

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Two-stage statistical language models for text database selection

Abstract

Article PDF

Similar content being viewed by others

Expanding Queries with Maximum Likelihood Estimators and Language Models

A survey of statistical approaches for query expansion

Keyword-Based Search Over Databases: A Roadmap for a Reference Architecture Paired with an Evaluation Framework

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords