skip to main content
10.1145/1341771.1341783acmotherconferencesArticle/Chapter ViewAbstractPublication PagescomputeConference Proceedingsconference-collections
research-article

Effect of word density on measuring words association

Published: 18 January 2008 Publication History

Abstract

The study of mining the associated words is not new. Because of its wide ranges of applications, it is still an important issue in Information Retrieval. The existing estimators such as joint probability, words association norm do not consider the density of the words present in each window. In this paper, we incorporate the word density and propose estimator based on word density to measure the association between the words. From various experimental results based on the human judgments and precision collected from search engines, we find that the precision of the estimators could be improved by incorporating word density. For all ranges of the size of the windows, our estimator outperforms all other estimators. We also observe that all these estimators (both existing and proposed one) perform relatively better when the windows contain around five sentences. We also show by using Spearman rank-order correlation coefficient that our estimator returns better quality of the ranking of the associated terms.

References

[1]
Kendall tau rank correlation coefficient. http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient.
[2]
Spearman's rank correlation coefficient. http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient.
[3]
N. Abe and H. Li. Learning word association norms using tree cut pair models. In Proceedings of the Int. Conf. on Machine Learning, pages 3--11, 1996.
[4]
N. Alemayehu. Analysis of performance variation using query expansion. Journal of the American Society for Information Science and Technology, 54(5):379--391, 2003.
[5]
K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. M. Blei, and M. I. Jordan. Matching words and pictures. J. Mach. Learn. Res., 3:1107--1135, 2003.
[6]
D. Bourigault and C. Jacquemin. Term extraction + term clustering: An integrated platform for computer-aided terminology. In Proceedings of the European Chapter of the Association for Computational Linguistics, pages 15--22, 1999.
[7]
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998.
[8]
C. Carpineto, R. de Mori, G. Romano, and B. Bigi. An information-theoretic approach to automatic query expansion. ACM Trans. Inf. Syst., 19(1):1--27, 2001.
[9]
C. Castillo. Effective web crawling. SIGIR Forum, 39(1):55--56, 2005.
[10]
S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks (Amsterdam, Netherlands: 1999), 31(11--16):1623--1640, 1999.
[11]
K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1):22--29, 1990.
[12]
H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma. Probabilistic query expansion using query logs. In Proceedings of the eleventh international conference on World Wide Web, pages 325--332, 2002.
[13]
M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Databases, VLDB 2000}, Cairo, Egypt, 10--14 September 2000.
[14]
M. Ester, M. Gross, and H. Kriegel. Focused web crawling: A generic framework for specifying the user interest and for adaptive crawling strategies, 2001.
[15]
R. M. Fano. Transmission Of Information. The MIT Press, 1961.
[16]
M. M. Ghanem, Y. Guo, H. Lodhi, and Y. Zhang. Automatic scientific text classification using local patterns: Kdd cup 2002 (task 1). SIGKDD Explor. Newsl., 4(2):95--96, 2002.
[17]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604--632, 1999.
[18]
M. E. Maron and J. L. Kuhns. On relevance, probabilistic indexing and information retrieval. Journal of ACM, 7(3):216--244, 1960.
[19]
H. J. Peat and P. Willett. The limitations of term cooccurrence data for query expansion in document retrieval systems. JASIS, 42(5):378--383, 1991.
[20]
C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27:379--423, 1948.
[21]
K. Sparck Jones. Automatic text classification for Information Retrieval. Butterworths, London, UK, 1971.
[22]
H. E. Stiles. The association factor in information retrieval. Journal of ACM, 8(2):271--279, 1961.
[23]
M. T. Tomohiko Sugimachi, Akira Ishino and F. Matsuo. A method of extracting related words using standardized mutual information. Discovery Science, 2843/2003(1):478--485, 2003.
[24]
C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 1979.
[25]
J.-R. Wen, J.-Y. Nie, and H.-J. Zhang. Clustering user queries of a search engine. In WWW10: Proceedings of the tenth international conference on World Wide Web, pages 162--168, 2001.
[26]
J. Xu and W. B. Croft. Query expansion using local and global document analysis. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4--11, 1996.

Index Terms

  1. Effect of word density on measuring words association

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    COMPUTE '08: Proceedings of the 1st Bangalore Annual Compute Conference
    January 2008
    195 pages
    ISBN:9781595939500
    DOI:10.1145/1341771
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    • ACM Bangalore chapter

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 January 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. joint probability
    2. word density
    3. words association
    4. words association norm

    Qualifiers

    • Research-article

    Conference

    COMPUTE08
    Sponsor:
    COMPUTE08: ACM Bangalore Chapter COMPUTE 2008
    January 18 - 20, 2008
    Bangalore, India

    Acceptance Rates

    Overall Acceptance Rate 114 of 622 submissions, 18%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 196
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media