research-article

Effect of word density on measuring words association

Authors:

Sanasam Ranbir Singh,

Hema A. Murthy,

Timothy A. GonsalvesAuthors Info & Claims

COMPUTE '08: Proceedings of the 1st Bangalore Annual Compute Conference

Article No.: 11, Pages 1 - 8

https://doi.org/10.1145/1341771.1341783

Published: 18 January 2008 Publication History

Abstract

The study of mining the associated words is not new. Because of its wide ranges of applications, it is still an important issue in Information Retrieval. The existing estimators such as joint probability, words association norm do not consider the density of the words present in each window. In this paper, we incorporate the word density and propose estimator based on word density to measure the association between the words. From various experimental results based on the human judgments and precision collected from search engines, we find that the precision of the estimators could be improved by incorporating word density. For all ranges of the size of the windows, our estimator outperforms all other estimators. We also observe that all these estimators (both existing and proposed one) perform relatively better when the windows contain around five sentences. We also show by using Spearman rank-order correlation coefficient that our estimator returns better quality of the ranking of the associated terms.

References

[1]

Kendall tau rank correlation coefficient. http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient.

[2]

Spearman's rank correlation coefficient. http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient.

[3]

N. Abe and H. Li. Learning word association norms using tree cut pair models. In Proceedings of the Int. Conf. on Machine Learning, pages 3--11, 1996.

[4]

N. Alemayehu. Analysis of performance variation using query expansion. Journal of the American Society for Information Science and Technology, 54(5):379--391, 2003.

Digital Library

[5]

K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. M. Blei, and M. I. Jordan. Matching words and pictures. J. Mach. Learn. Res., 3:1107--1135, 2003.

Digital Library

[6]

D. Bourigault and C. Jacquemin. Term extraction + term clustering: An integrated platform for computer-aided terminology. In Proceedings of the European Chapter of the Association for Computational Linguistics, pages 15--22, 1999.

Digital Library

[7]

S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998.

Digital Library

[8]

C. Carpineto, R. de Mori, G. Romano, and B. Bigi. An information-theoretic approach to automatic query expansion. ACM Trans. Inf. Syst., 19(1):1--27, 2001.

Digital Library

[9]

C. Castillo. Effective web crawling. SIGIR Forum, 39(1):55--56, 2005.

Digital Library

[10]

S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks (Amsterdam, Netherlands: 1999), 31(11--16):1623--1640, 1999.

Digital Library

[11]

K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1):22--29, 1990.

Digital Library

[12]

H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma. Probabilistic query expansion using query logs. In Proceedings of the eleventh international conference on World Wide Web, pages 325--332, 2002.

Digital Library

[13]

M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Databases, VLDB 2000}, Cairo, Egypt, 10--14 September 2000.

Digital Library

[14]

M. Ester, M. Gross, and H. Kriegel. Focused web crawling: A generic framework for specifying the user interest and for adaptive crawling strategies, 2001.

[15]

R. M. Fano. Transmission Of Information. The MIT Press, 1961.

[16]

M. M. Ghanem, Y. Guo, H. Lodhi, and Y. Zhang. Automatic scientific text classification using local patterns: Kdd cup 2002 (task 1). SIGKDD Explor. Newsl., 4(2):95--96, 2002.

Digital Library

[17]

J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604--632, 1999.

Digital Library

[18]

M. E. Maron and J. L. Kuhns. On relevance, probabilistic indexing and information retrieval. Journal of ACM, 7(3):216--244, 1960.

Digital Library

[19]

H. J. Peat and P. Willett. The limitations of term cooccurrence data for query expansion in document retrieval systems. JASIS, 42(5):378--383, 1991.

[20]

C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27:379--423, 1948.

[21]

K. Sparck Jones. Automatic text classification for Information Retrieval. Butterworths, London, UK, 1971.

[22]

H. E. Stiles. The association factor in information retrieval. Journal of ACM, 8(2):271--279, 1961.

Digital Library

[23]

M. T. Tomohiko Sugimachi, Akira Ishino and F. Matsuo. A method of extracting related words using standardized mutual information. Discovery Science, 2843/2003(1):478--485, 2003.

[24]

C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 1979.

Digital Library

[25]

J.-R. Wen, J.-Y. Nie, and H.-J. Zhang. Clustering user queries of a search engine. In WWW10: Proceedings of the tenth international conference on World Wide Web, pages 162--168, 2001.

Digital Library

[26]

J. Xu and W. B. Croft. Query expansion using local and global document analysis. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4--11, 1996.

Digital Library

Index Terms

Effect of word density on measuring words association
1. Information systems
  1. Information retrieval

Recommendations

Knowledge Transfer via Word Alignment and Its Application to Vietnamese POS Tagging
Computational Data and Social Networks
Abstract
It is not difficult to build a linguistic tagger with a large annotated corpus. Labeled data becomes a big problem with low-resource languages such as Vietnamese. Due to the development and investment in research, there is no large and high-...
Automatic Detection of Words Associations in Texts Based on Joint Distribution of Words Occurrences

In this article, we propose a novel approach for measuring word association based on the joint occurrences distribution in a text. Our approach relies on computing a sum of distances between neighboring occurrences of a given word pair and comparing it ...
Hindi Word Sense Disambiguation Using Lesk Approach on Bigram and Trigram Words
AICTC '16: Proceedings of the International Conference on Advances in Information Communication Technology & Computing

Word Sense Disambiguation (WSD) is a vital task which provides the definition of particular words according to their sense or according to given context. Lesk algorithm is originally based on the gloss overlap that can be observed as the measure, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

COMPUTE '08: Proceedings of the 1st Bangalore Annual Compute Conference

January 2008

195 pages

ISBN:9781595939500

DOI:10.1145/1341771

Program Chair:
R. K. Shyamasundar
IBM Research Labs

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM Bangalore chapter

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 January 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

COMPUTE08

Sponsor:

COMPUTE08: ACM Bangalore Chapter COMPUTE 2008

January 18 - 20, 2008

Bangalore, India

Acceptance Rates

Overall Acceptance Rate 114 of 622 submissions, 18%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
196
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten