Higher-Order Smoothing: A Novel Semantic Smoothing Method for Text Classification

Poyraz, Mitat; Kilimci, Zeynep Hilal; Ganiz, Murat Can

doi:10.1007/s11390-014-1437-6

Higher-Order Smoothing: A Novel Semantic Smoothing Method for Text Classification

Regular Paper
Published: 17 May 2014

Volume 29, pages 376–391, (2014)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Mitat Poyraz¹,
Zeynep Hilal Kilimci¹ &
Murat Can Ganiz¹

200 Accesses
12 Citations
Explore all metrics

Abstract

It is known that latent semantic indexing (LSI) takes advantage of implicit higher-order (or latent) structure in the association of terms and documents. Higher-order relations in LSI capture “latent semantics”. These findings have inspired a novel Bayesian framework for classification named Higher-Order Naive Bayes (HONB), which was introduced previously, that can explicitly make use of these higher-order relations. In this paper, we present a novel semantic smoothing method named Higher-Order Smoothing (HOS) for the Naive Bayes algorithm. HOS is built on a similar graph based data representation of the HONB which allows semantics in higher-order paths to be exploited. We take the concept one step further in HOS and exploit the relationships between instances of different classes. As a result, we move beyond not only instance boundaries, but also class boundaries to exploit the latent information in higher-order paths. This approach improves the parameter estimation when dealing with insufficient labeled data. Results of our extensive experiments demonstrate the value of HOS on several benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

Article 15 April 2024

A novel feature and class-based globalization technique for text classification

Article 25 April 2023

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

References

Taskar B, Abbeel P, Koller D. Discriminative probabilistic models for relational data. In Proc. the 18th Conf. Uncertainty in Artificial Intelligence, August 2002, pp.485-492.
Chakrabarti S, Dom B, Indyk P. Enhanced hypertext categorization using hyperlinks. In Proc. International Conference on Management of Data, June 1998, pp.307-318.
Neville J, Jensen D. Iterative classification in relational data. In Proc. AAAI 2000 Workshop on Learning Statistical Models from Relational Data, July 2000, pp.13-20.
Getoor L, Diehl C P. Link mining: A survey. ACM SIGKDD Explorations Newsletter, 2005, 7(2): 3-12.
Article Google Scholar
Ganiz M C, Kanitkar S, Chuah M C, Pottenger W M. Detection of interdomain routing anomalies based on higher-order path analysis. In Proc. the 6th IEEE International Conference on Data Mining, December 2006, pp.874-879.
Ganiz M C, Lytkin N, Pottenger W M. Leveraging higher order dependencies between features for text classification. In Proc. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, September 2009, pp.375-390.
Ganiz M C, George C, Pottenger W M. Higher order Naive Bayes: A novel non-IID approach to text classification. IEEE Trans. Knowledge and Data Engineering, 2011, 23(7): 1022-1034.
Article Google Scholar
Lytkin N. Variance-based clustering methods and higher order data transformations and their applications [Ph.D. Thesis]. Rutgers University, NJ, 2009.
Google Scholar
Edwards A, Pottenger W M. Higher order Q-Learning. In Proc. IEEE Symp. Adaptive Dynamic Programming and Reinforcement Learning, April 2011, pp.128-134.
Deerwester S C, Dumais S T, Landauer T K et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
Article Google Scholar
Kontostathis A, Pottenger W M. A framework for understanding latent semantic indexing (LSI) performance. Journal of the Information Processing and Management, 2006, 42(1): 56-73.
Article Google Scholar
Sarah Z, Hirsh H. Transductive LSI for short text classification problems. In Proc. the 17th International Florida Artificial Intelligence Research Society Conference, May 2004, pp.556-561.
Li S, Wu T, Pottenger W M. Distributed higher order association rule mining using information extracted from textual data. SIGKDD Explorations Newsletter — Natural Language Processing and Text Mining, 2005, 7(1): 26-35.
McCallum A, Nigam K. A comparison of event models for Naive Bayes text classification. In Proc. AAAI 1998 Workshop on Learning for Text Categorization, July 1998, pp.41-48.
Kim S B, Han K S, Rim H C, Myaeng S H. Some effective techniques for naive Bayes text classification. IEEE Trans. Knowl. Data Eng., 2006, 18(11): 1457-1466.
Article Google Scholar
Schneider K M. On word frequency information and negative evidence in Naive Bayes text classification. In Proc. Int. Conf. Advances in Natural Language Processing, October 2004, pp.474-485.
Metsis V, Androutsopoulos I, Paliouras G. Spam filtering with Naive Bayes — Which Naive Bayes?. In Proc. Conference on Email and Anti-Spam, July 2006.
McCallum A, Nigam K. Text classification by bootstrapping with keywords, EM and shrinkage. In Proc. ACL 1999 Workshop for the Unsupervised Learning in Natural Language Processing, June 1999, pp.52-58.
Juan A, Ney H. Reversing and smoothing the multinomial Naive Bayes text classifier. In Proc. International Workshop on Pattern Recognition in Information Systems, April 2002, pp.200-212.
Peng F, Schuurmans D, Wang S. Augmenting naive Bayes classifiers with statistical language models. Information Retrieval, 2004, 7(3/4): 317-345.
Article Google Scholar
Zhou X, Zhang X, Hu X. Semantic smoothing for Bayesian text classification with small training data. In Proc. International Conference on Data Mining, April 2008, pp.289-300.
Chen S F, Goodman J. An empirical study of smoothing techniques for language modeling. In Proc. the 34th Annual Meeting on Association for Computational Linguistics, June 1996, pp.310-318
Joachims T. Text categorization with support vector machines: Learning with many relevant features. In Proc. the 10th European Conf. Machine Learning, Apr. 1998, pp.137-142.
Gao B, Liu T, Feng G, Qin T, Cheng Q, Ma W. Hierarchical taxonomy preparation for text categorization using consistent bipartite spectral graph co-partitioning. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(9): 1263-1273.
Article Google Scholar
Aggarwal C C, Zhao P. Towards graphical models for text processing. Knowledge and Information Systems, 2013, 36(1): 1-21.
Article Google Scholar
Tomás D, Vicedo J L. Minimally supervised question classification on fine-grained taxonomies. Knowledge and Information Systems, 2013, 36(2): 303-334.
Article Google Scholar
Nguyen T T, Chang K, Hui S C. Supervised term weighting centroid-based classifiers for text categorization. Knowledge and Information Systems, 2013, 35(1): 61-85
Article Google Scholar
Chakrabarti S. Supervised learning. In Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, 2002, pp.148-151.
Manning C D, Schütze H. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999.
MATH Google Scholar
AmasyalıM F, Beken A. Measurement of Turkish word semantic similarity and text categorization application. In Proc. IEEE Signal Processing and Communications Applications Conference, April 2009. (in Turkish)
Torunoğlu D, Çakırman E, Ganiz M C et al. Analysis of preprocessing methods on classification of Turkish texts. In Proc. International Symposium on Innovations in Intelligent Systems and Applications, June 2011, pp.112-118.
Rennie J D, Shih L, Teevan J, Karger D R. Tackling the poor assumptions of Naive Bayes text classifiers. In Proc. ICML2003, August 2003, pp.616-623.
Eyheramendy S, Lewis D D, Madigan D. On the Naive Bayes model for text categorization. In Proc. the 9th International Workshop on Artificial Intelligence and Statistics, January 2003, pp.332-339.
Kolcz A, Yih W. Raising the baseline for high-precision text classifiers. In Proc. the 13th Int. Conf. Knowledge Discovery and Data Mining, August 2007, pp.400-409.
Japkowicz N, Shah M. Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, 2011.
Su J, Shirab J S, Matwin S. Large scale text classification using semi-supervised multinomial Naive Bayes. In Proc. the 28th Int. Conf. Machine Learning, June 2011, pp.97-104.
Nakov P, Popova A, Mateev P. Weight functions impact on LSA performance. In Proc. the EuroConference Recent Advances in Natural Language Processing, September 2001, pp.187-193.
Poyraz M, Kilimci Z H, Ganiz M C. A novel semantic smoothing method based on higher order paths for text classification. In Proc. IEEE Int. Conf. Data Mining, Dec. 2012, pp.615-624.

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Dogus University, Istanbul, 34722, Turkey
Mitat Poyraz, Zeynep Hilal Kilimci & Murat Can Ganiz

Authors

Mitat Poyraz
View author publications
You can also search for this author in PubMed Google Scholar
Zeynep Hilal Kilimci
View author publications
You can also search for this author in PubMed Google Scholar
Murat Can Ganiz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Murat Can Ganiz.

Additional information

This work was supported in part by the Scientific and Technological Research Council of Turkey (TÜBÍTAK) under Grant No. 111E239. Points of views in this document are those of the authors and do not necessarily represent the official position or policies of the TÜBÍTAK

A preliminary version of this paper was published in the Proceedings of ICDM 2012.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 75 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Poyraz, M., Kilimci, Z.H. & Ganiz, M.C. Higher-Order Smoothing: A Novel Semantic Smoothing Method for Text Classification. J. Comput. Sci. Technol. 29, 376–391 (2014). https://doi.org/10.1007/s11390-014-1437-6

Download citation

Received: 01 September 2013
Revised: 11 March 2014
Published: 17 May 2014
Issue Date: May 2014
DOI: https://doi.org/10.1007/s11390-014-1437-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Higher-Order Smoothing: A Novel Semantic Smoothing Method for Text Classification

Abstract

Access this article

Similar content being viewed by others

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

A novel feature and class-based globalization technique for text classification

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Higher-Order Smoothing: A Novel Semantic Smoothing Method for Text Classification

Abstract

Access this article

Similar content being viewed by others

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

A novel feature and class-based globalization technique for text classification

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation