Skip to main content

Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge

  • Conference paper
Soft Computing Applications and Intelligent Systems (M-CAIT 2013)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 378))

Included in the following conference series:

  • 892 Accesses

Abstract

The basic Bag of Words (BOW) representation, that is generally used in text documents clustering or categorization, loses important syntactic and semantic information contained in the documents. When the text document contains a lot of stop words or when they are of a short length this may be particularly problematic. In this paper, we study the contribution of incorporating syntactic features and semantic knowledge into the representation in clustering texts corpus. We investigate the quality of clusters produced when incorporating syntactic and semantic information into the representation of text documents by analyzing the internal structure of the cluster using the Davies- Bouldin (DBI) index. This paper studies and compares the quality of the clusters produced when four different sets of text representation used to cluster texts corpus. These text representations include the standard BOW representation, the standard BOW representation integrated with syntactic features, the standard BOW representation integrated with semantic background knowledge and finally the standard BOW representation integrated with both syntactic features and semantic background knowledge. Based on the experimental results, it is shown that the quality of clusters produced is improved by integrating the semantic and syntactic information into the standard bag of words representation of texts corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Choudhary, B., Bhattacharyya, P.: Textclustering using Universal Networking Language representation. In: Eleventh International World Wide Web Conference (2003)

    Google Scholar 

  2. Zelikovitz, S., Hirsh, H.: Improving Text Classification with LSI Using Background Knowledge. In: Proceedingsof CIKM 2001,10th ACM International Conference on Information and Knowledge Management (2001)

    Google Scholar 

  3. Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Trans. Inf. Syst. (1999)

    Google Scholar 

  4. Goadrich, M., Oliphant, L., Shavlik, J.: Learning Ensembles of First-Order Clauses for Recall-Precision Curves: A Case Study in Biomedical Information Extraction. In: Proceedings of the Fourteenth International Conference on Inductive Logic Programming, Porto, Portugal (2004)

    Google Scholar 

  5. Maria, F.C., Stan, M.: Incorporating Syntax and Semantics in the Text Representation for Sentence Selection. In: Recent Advances in Natural Language Processing, Borovets, Bulgaria (2007)

    Google Scholar 

  6. Lewis, D.D.: Representation and Learning in Information Retrieval, Ph.D. dissertation, University of Massachusetts (1992)

    Google Scholar 

  7. Siolas, G.: Modèles probabilistes et noyaux pour l’extraction d’informations à partir de documents. Thèsede doctorat de l’Université Paris (2003)

    Google Scholar 

  8. Zelikovitz, S., Hirsh, H.: Improving Text Classification with LSI Using Background Knowledge. In: Proceedings of CIKM 2001,10th ACM International Conference on Information and Knowledge Management (2001)

    Google Scholar 

  9. Moschitti, A., Basili, R.: Complex Linguistic Features for Text Classification: a Comprehensive Study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  10. Porter, M.F.: Analgorithm for suffix stripping. In: Jones, K.S., Willett, P. (eds.) Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc., SanFrancisco (1997)

    Google Scholar 

  11. Salton, G., McGill, M.J.: Introduction to modern information retrieval. McGraw-Hill (1983)

    Google Scholar 

  12. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairssimilarity search. In: WWW 2007 - Proceedings of the 16th International World Wide Web Conference, pp.131–140 (2007)

    Google Scholar 

  13. Davies, D.L., Bouldin, D.W.: A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2, 224 (1979)

    Google Scholar 

  14. Miller, G.A., Beckwith, R., Fellbaum, C.D., Gross, D., Miller, K.: WordNet: Anonline lexical database. Int. J. Lexicograph 3(4), 235–244 (1990)

    Article  Google Scholar 

  15. Gonzalo, J., Verdejo, F., Chugur, I., Cigarrán, J.M.: Indexing with WordNet synsets can improve Text Retrieval, CoRR (1998)

    Google Scholar 

  16. Yamakawa, H., Jing, P., Feldman, A.: Semantic enrichment of text representation with Wikipedia for text classification. In: Systems Man and Cybernetics (SMC 2010), pp. 4333–4340 (2010)

    Google Scholar 

  17. Alfred, R., Mujat, A., Obit, J.H.: A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles. In: Selamat, A., et al. (eds.) ACIIDS 2013, Part II. LNCS, vol. 7803, pp. 50–59. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  18. Leong, L.C., Basri, S., Alfred, R.: Enhancing Malay Stemming Algorithm with Background Knowledge. In: Anthony, P., Ishizuka, M., Lukose, D. (eds.) PRICAI 2012. LNCS, vol. 7458, pp. 753–758. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Alfred, R., Anthony, P., Alias, S., Tahir, A., Chin, K.O., Keng, L.H. (2013). Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge. In: Noah, S.A., et al. Soft Computing Applications and Intelligent Systems. M-CAIT 2013. Communications in Computer and Information Science, vol 378. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40567-9_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40567-9_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40566-2

  • Online ISBN: 978-3-642-40567-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics