Skip to main content

Document Classification Using POS Distribution

  • Conference paper
  • 726 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7503))

Abstract

In this investigation, we discuss how to classify very quickly documents in Japanese putting stress on Part Of Speech (POS) distribution, not word distribution. There exist two main contributon of this investigation: linear regression approach models POS behavior in Japanese documents very well for classification, and a new excellent and efficient classification proposed based on Gaussian probability distribution, called Gaussian classifier.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  2. Kabashima, T.: On the ratio of parts of speech in present-day Japanese and the cause of its fluctuation. Kokugi Kokubun 24(6), 385–387 (1955) (in Japanese)

    Google Scholar 

  3. Kurohashi, S., Nagao, M.: KN Parser: Japanese Dependency/Case Structure Analyzer. In: Workshop on Sharable Natural Language Resources (1994)

    Google Scholar 

  4. Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT (1999)

    Google Scholar 

  5. Mizutani, S.: On Ohno’s lexical law. Keiryo-Kokugogaku. Mathematical Linguistics of Japanese 35, 1–12 (1965) (in Japanese)

    Google Scholar 

  6. Jim, M., Murakami, M.: Authorship Identification Using Random Forests. In: Proc. Inst. of Statistical Mathematics, vol. 55(2), pp. 255–268 (2007)

    Google Scholar 

  7. Ohno, S.: A study of several themes on the basic lexicon – In Japanese classical literary works. Kokugogaku (Japanese language) 24, 34–46 (1956) (in Japanese)

    Google Scholar 

  8. Rosen-Zvi, M., Griffiths, S.M., Smyth, T.: The author-topic model for authors and documents. In: UAI 2004 Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (2004)

    Google Scholar 

  9. Shirai, M., Miura, T.: On Domain Independence of Author Identification. In: Yin, H., Wang, W., Rayward-Smith, V. (eds.) IDEAL 2011. LNCS, vol. 6936, pp. 9–16. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Shirai, M., Miura, T. (2012). Document Classification Using POS Distribution. In: Morzy, T., Härder, T., Wrembel, R. (eds) Advances in Databases and Information Systems. ADBIS 2012. Lecture Notes in Computer Science, vol 7503. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33074-2_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33074-2_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33073-5

  • Online ISBN: 978-3-642-33074-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics