Distance Weighted Cosine Similarity Measure for Text Classification

Li, Baoli; Han, Liping

doi:10.1007/978-3-642-41278-3_74

Baoli Li²⁴ &
Liping Han²⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8206))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

6238 Accesses
86 Citations

Abstract

In Vector Space Model, Cosine is widely used to measure the similarity between two vectors. Its calculation is very efficient, especially for sparse vectors, as only the non-zero dimensions need to be considered. As a fundamental component, cosine similarity has been applied in solving different text mining problems, such as text classification, text summarization, information retrieval, question answering, and so on. Although it is popular, the cosine similarity does have some problems. Starting with a few synthetic samples, we demonstrate some problems of cosine similarity: it is overly biased by features of higher values and does not care much about how many features two vectors share. A distance weighted cosine similarity metric is thus proposed. Extensive experiments on text classification exhibit the effectiveness of the proposed metric.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing, Boston (1989)
Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.M.B.: The Similarity Metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)
Article MathSciNet Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Cardoso-Cachopo, A.: Improving Methods for Single-label Text Categorization. PhD Thesis, Instituto Superior Técnico, Portugal (2007)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of Fourteenth International Conference on Machine Learning, pp. 412–420 (1997)
Google Scholar
Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)
Google Scholar
Li, B., Lu, Q., Yu, S.: An adaptive k-nearest neighbor text categorization strategy. ACM Transactions on Asian Language Information Processing (TALIP) 3(4), 215–226 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Henan University of Technology, 1 Lotus Street, High & New Industrial Development Zone, Zhengzhou, Henan, 450001, China
Baoli Li & Liping Han

Authors

Baoli Li
View author publications
You can also search for this author in PubMed Google Scholar
Liping Han
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Electrical and Electronic Engineering, University of Manchester, UK
Hujun Yin
University of Science and Technology of China, Hefei, China
Ke Tang
Nanjing University, Nanjing, China
Yang Gao
Ostfalia University of Applied Sciences, 38302, Wolfenbüttel, Germany
Frank Klawonn
Kyungpook National University, 702-701, Buk-Gu, Daegu, Korea
Minho Lee
Nature Inspired Computational and Applications Laboratory, School of Computer Science and Technology,, University of Science and Technology of China, 230027, Hefei, China
Thomas Weise
University of Science and Technology of China, 230017, Hefei, China
Bin Li
CERCIA, School of Computer Science, University of Birmingham, B15 2TT, Edgbaston, Birmingham, UK
Xin Yao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, B., Han, L. (2013). Distance Weighted Cosine Similarity Measure for Text Classification. In: Yin, H., et al. Intelligent Data Engineering and Automated Learning – IDEAL 2013. IDEAL 2013. Lecture Notes in Computer Science, vol 8206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41278-3_74

Download citation

DOI: https://doi.org/10.1007/978-3-642-41278-3_74
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41277-6
Online ISBN: 978-3-642-41278-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics