Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge

Alfred, Rayner; Anthony, Patricia; Alias, Suraya; Tahir, Asni; Chin, Kim On; Keng, Lau Hui

doi:10.1007/978-3-642-40567-9_24

Rayner Alfred⁷,
Patricia Anthony⁸,
Suraya Alias⁷,
Asni Tahir⁷,
Kim On Chin⁷ &
…
Lau Hui Keng⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 378))

Included in the following conference series:

International Multi-Conference on Artificial Intelligence Technology

892 Accesses

Abstract

The basic Bag of Words (BOW) representation, that is generally used in text documents clustering or categorization, loses important syntactic and semantic information contained in the documents. When the text document contains a lot of stop words or when they are of a short length this may be particularly problematic. In this paper, we study the contribution of incorporating syntactic features and semantic knowledge into the representation in clustering texts corpus. We investigate the quality of clusters produced when incorporating syntactic and semantic information into the representation of text documents by analyzing the internal structure of the cluster using the Davies- Bouldin (DBI) index. This paper studies and compares the quality of the clusters produced when four different sets of text representation used to cluster texts corpus. These text representations include the standard BOW representation, the standard BOW representation integrated with syntactic features, the standard BOW representation integrated with semantic background knowledge and finally the standard BOW representation integrated with both syntactic features and semantic background knowledge. Based on the experimental results, it is shown that the quality of clusters produced is improved by integrating the semantic and syntactic information into the standard bag of words representation of texts corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Domain-Adapting Word Representation Method for Word Clustering

Harnessing Key Phrases in Constructing a Concept-Based Semantic Representation of Text Using Clustering Techniques

BOWL: Bag of Word Clusters Text Representation Using Word Embeddings

References

Choudhary, B., Bhattacharyya, P.: Textclustering using Universal Networking Language representation. In: Eleventh International World Wide Web Conference (2003)
Google Scholar
Zelikovitz, S., Hirsh, H.: Improving Text Classification with LSI Using Background Knowledge. In: Proceedingsof CIKM 2001,10th ACM International Conference on Information and Knowledge Management (2001)
Google Scholar
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Trans. Inf. Syst. (1999)
Google Scholar
Goadrich, M., Oliphant, L., Shavlik, J.: Learning Ensembles of First-Order Clauses for Recall-Precision Curves: A Case Study in Biomedical Information Extraction. In: Proceedings of the Fourteenth International Conference on Inductive Logic Programming, Porto, Portugal (2004)
Google Scholar
Maria, F.C., Stan, M.: Incorporating Syntax and Semantics in the Text Representation for Sentence Selection. In: Recent Advances in Natural Language Processing, Borovets, Bulgaria (2007)
Google Scholar
Lewis, D.D.: Representation and Learning in Information Retrieval, Ph.D. dissertation, University of Massachusetts (1992)
Google Scholar
Siolas, G.: Modèles probabilistes et noyaux pour l’extraction d’informations à partir de documents. Thèsede doctorat de l’Université Paris (2003)
Google Scholar
Zelikovitz, S., Hirsh, H.: Improving Text Classification with LSI Using Background Knowledge. In: Proceedings of CIKM 2001,10th ACM International Conference on Information and Knowledge Management (2001)
Google Scholar
Moschitti, A., Basili, R.: Complex Linguistic Features for Text Classification: a Comprehensive Study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)
Chapter Google Scholar
Porter, M.F.: Analgorithm for suffix stripping. In: Jones, K.S., Willett, P. (eds.) Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc., SanFrancisco (1997)
Google Scholar
Salton, G., McGill, M.J.: Introduction to modern information retrieval. McGraw-Hill (1983)
Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairssimilarity search. In: WWW 2007 - Proceedings of the 16th International World Wide Web Conference, pp.131–140 (2007)
Google Scholar
Davies, D.L., Bouldin, D.W.: A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2, 224 (1979)
Google Scholar
Miller, G.A., Beckwith, R., Fellbaum, C.D., Gross, D., Miller, K.: WordNet: Anonline lexical database. Int. J. Lexicograph 3(4), 235–244 (1990)
Article Google Scholar
Gonzalo, J., Verdejo, F., Chugur, I., Cigarrán, J.M.: Indexing with WordNet synsets can improve Text Retrieval, CoRR (1998)
Google Scholar
Yamakawa, H., Jing, P., Feldman, A.: Semantic enrichment of text representation with Wikipedia for text classification. In: Systems Man and Cybernetics (SMC 2010), pp. 4333–4340 (2010)
Google Scholar
Alfred, R., Mujat, A., Obit, J.H.: A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles. In: Selamat, A., et al. (eds.) ACIIDS 2013, Part II. LNCS, vol. 7803, pp. 50–59. Springer, Heidelberg (2013)
Chapter Google Scholar
Leong, L.C., Basri, S., Alfred, R.: Enhancing Malay Stemming Algorithm with Background Knowledge. In: Anthony, P., Ishizuka, M., Lukose, D. (eds.) PRICAI 2012. LNCS, vol. 7458, pp. 753–758. Springer, Heidelberg (2012)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Center of Excellence in Semantic Agents, School of Engineering and Information Technology, Universiti Malaysia Sabah, Jalan UMS, 88400, Kota Kinabalu, Sabah, Malaysia
Rayner Alfred, Suraya Alias, Asni Tahir, Kim On Chin & Lau Hui Keng
Department of Applied Computing, Faculty of Environment, Society and Design, Lincoln University, Christchurch, New Zealand
Patricia Anthony

Authors

Rayner Alfred
View author publications
You can also search for this author in PubMed Google Scholar
Patricia Anthony
View author publications
You can also search for this author in PubMed Google Scholar
Suraya Alias
View author publications
You can also search for this author in PubMed Google Scholar
Asni Tahir
View author publications
You can also search for this author in PubMed Google Scholar
Kim On Chin
View author publications
You can also search for this author in PubMed Google Scholar
Lau Hui Keng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Information Science & Technology, University Kebangsaan Malaysia, Bangi, Selangor, Malaysia
Shahrul Azman Noah
Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600 UKM, Bangi, Selangor D. E, Malaysia
Azizi Abdullah
Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor Darul Ehsan, Malaysia
Haslina Arshad , Zulaiha Ali Othman & Zalinda Othman , &
School of Computer Science, FTSM, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor Darul Ehsan, Malaysia
Azuraliza Abu Bakar
Pattern Recognition Research Group, CAIT, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia
Shahnorbanun Sahran
Faculty of Information Science & IT, National University of Malaysia, 43600, Bangi, Selangor, Malaysia
Nazlia Omar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alfred, R., Anthony, P., Alias, S., Tahir, A., Chin, K.O., Keng, L.H. (2013). Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge. In: Noah, S.A., et al. Soft Computing Applications and Intelligent Systems. M-CAIT 2013. Communications in Computer and Information Science, vol 378. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40567-9_24

Download citation

DOI: https://doi.org/10.1007/978-3-642-40567-9_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40566-2
Online ISBN: 978-3-642-40567-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics