Abstract
The basic Bag of Words (BOW) representation, that is generally used in text documents clustering or categorization, loses important syntactic and semantic information contained in the documents. When the text document contains a lot of stop words or when they are of a short length this may be particularly problematic. In this paper, we study the contribution of incorporating syntactic features and semantic knowledge into the representation in clustering texts corpus. We investigate the quality of clusters produced when incorporating syntactic and semantic information into the representation of text documents by analyzing the internal structure of the cluster using the Davies- Bouldin (DBI) index. This paper studies and compares the quality of the clusters produced when four different sets of text representation used to cluster texts corpus. These text representations include the standard BOW representation, the standard BOW representation integrated with syntactic features, the standard BOW representation integrated with semantic background knowledge and finally the standard BOW representation integrated with both syntactic features and semantic background knowledge. Based on the experimental results, it is shown that the quality of clusters produced is improved by integrating the semantic and syntactic information into the standard bag of words representation of texts corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Choudhary, B., Bhattacharyya, P.: Textclustering using Universal Networking Language representation. In: Eleventh International World Wide Web Conference (2003)
Zelikovitz, S., Hirsh, H.: Improving Text Classification with LSI Using Background Knowledge. In: Proceedingsof CIKM 2001,10th ACM International Conference on Information and Knowledge Management (2001)
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Trans. Inf. Syst. (1999)
Goadrich, M., Oliphant, L., Shavlik, J.: Learning Ensembles of First-Order Clauses for Recall-Precision Curves: A Case Study in Biomedical Information Extraction. In: Proceedings of the Fourteenth International Conference on Inductive Logic Programming, Porto, Portugal (2004)
Maria, F.C., Stan, M.: Incorporating Syntax and Semantics in the Text Representation for Sentence Selection. In: Recent Advances in Natural Language Processing, Borovets, Bulgaria (2007)
Lewis, D.D.: Representation and Learning in Information Retrieval, Ph.D. dissertation, University of Massachusetts (1992)
Siolas, G.: Modèles probabilistes et noyaux pour l’extraction d’informations à partir de documents. Thèsede doctorat de l’Université Paris (2003)
Zelikovitz, S., Hirsh, H.: Improving Text Classification with LSI Using Background Knowledge. In: Proceedings of CIKM 2001,10th ACM International Conference on Information and Knowledge Management (2001)
Moschitti, A., Basili, R.: Complex Linguistic Features for Text Classification: a Comprehensive Study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)
Porter, M.F.: Analgorithm for suffix stripping. In: Jones, K.S., Willett, P. (eds.) Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc., SanFrancisco (1997)
Salton, G., McGill, M.J.: Introduction to modern information retrieval. McGraw-Hill (1983)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairssimilarity search. In: WWW 2007 - Proceedings of the 16th International World Wide Web Conference, pp.131–140 (2007)
Davies, D.L., Bouldin, D.W.: A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2, 224 (1979)
Miller, G.A., Beckwith, R., Fellbaum, C.D., Gross, D., Miller, K.: WordNet: Anonline lexical database. Int. J. Lexicograph 3(4), 235–244 (1990)
Gonzalo, J., Verdejo, F., Chugur, I., Cigarrán, J.M.: Indexing with WordNet synsets can improve Text Retrieval, CoRR (1998)
Yamakawa, H., Jing, P., Feldman, A.: Semantic enrichment of text representation with Wikipedia for text classification. In: Systems Man and Cybernetics (SMC 2010), pp. 4333–4340 (2010)
Alfred, R., Mujat, A., Obit, J.H.: A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles. In: Selamat, A., et al. (eds.) ACIIDS 2013, Part II. LNCS, vol. 7803, pp. 50–59. Springer, Heidelberg (2013)
Leong, L.C., Basri, S., Alfred, R.: Enhancing Malay Stemming Algorithm with Background Knowledge. In: Anthony, P., Ishizuka, M., Lukose, D. (eds.) PRICAI 2012. LNCS, vol. 7458, pp. 753–758. Springer, Heidelberg (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Alfred, R., Anthony, P., Alias, S., Tahir, A., Chin, K.O., Keng, L.H. (2013). Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge. In: Noah, S.A., et al. Soft Computing Applications and Intelligent Systems. M-CAIT 2013. Communications in Computer and Information Science, vol 378. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40567-9_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-40567-9_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40566-2
Online ISBN: 978-3-642-40567-9
eBook Packages: Computer ScienceComputer Science (R0)