LiCord: Language Independent Content Word Finder

Rahoman, Md-Mizanur; Nasukawa, Tetsuya; Kanayama, Hiroshi; Ichise, Ryutaro

doi:10.1007/978-3-319-32034-2_4

Md-Mizanur Rahoman¹⁷,
Tetsuya Nasukawa¹⁸,
Hiroshi Kanayama¹⁸ &
…
Ryutaro Ichise^17,19

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9648))

Included in the following conference series:

International Conference on Hybrid Artificial Intelligence Systems

2189 Accesses

Abstract

Content Words (CWs) are important segments of the text. In text mining, we utilize them for various purposes such as topic identification, document summarization, question answering etc. Usually, the identification of CWs requires various language dependent tools. However, such tools are not available for many languages and developing of them for all languages is costly. On the other hand, because of recent growth of text contents in various languages, language independent text mining carries great potentiality. To mine text automatically, the language tool independent CWs finding is a requirement. In this research, we devise a framework that identifies text segments into CWs in a language independent way. We identify some structural features that relate text segments into CWs. We devise the features over a large text corpus and apply machine learning-based classification that classifies the segments into CWs. The proposed framework only uses large text corpus and some training examples, apart from these, it does not require any language specific tool. We conduct experiments of our framework for three different languages: English, Vietnamese and Indonesian, and found that it works with more than 83 % accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Classification of Chinese Texts Based on Recognition of Semantic Topics

Article 02 July 2015

A Method on Chinese Thesauri

A Sample Extension Method Based on Wikipedia and Its Application in Text Classification

Article 08 February 2018

Notes

1.
In later part, we will use n-gram(s) to mean word n-gram(s).
2.
http://www2.fs.u-bunkyo.ac.jp/~gilner/wordlists.html#functionwords.
3.
https://translate.google.com/.
4.
http://www.speech.sri.com/projects/srilm/.
5.
More accurately DBpedia annotator, DBpedia works as structured version of Wikipedia, it can be found at http://dbpedia.org/about/.
6.
http://cogcomp.cs.illinois.edu/page/demo_view/Wikifier (for the example of GoogleChina), and http://dbpediaspotlight.github.io/demo/, respectively.
7.
http://nlp.stanford.edu/software/lex-parser.shtml.

References

Aggarwal, C.C., Zhai, C. (eds.): Mining Text Data. Springer, New York (2012)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Gamon, M., Aue, A., Corston-Oliver, S., Ringger, E.: Pulse: mining customer opinions from free text. In: Famili, A.F., Kok, J.N., Peña, J.M., Siebes, A., Feelders, A. (eds.) IDA 2005. LNCS, vol. 3646, pp. 121–132. Springer, Heidelberg (2005)
Chapter Google Scholar
Kanayama, H., Nasukawa, T.: Unsupervised lexicon induction for clause-level detection of evaluations. Nat. Lang. Eng. 18(1), 83–107 (2012)
Article Google Scholar
Kim, S., Toutanova, K., Yu, H.: Multilingual named entity recognition using parallel data and metadata from wikipedia. In: Proceedings of the 50th Annual Meeting on Association for Computational Linguistics, pp. 694–702 (2012)
Google Scholar
Lewis, D.: What is web 2.0? Crossroads 13(1), 3–3 (2006)
Article Google Scholar
Ma, Y., Wu, J.: Combining n-gram and dependency word pair for multi-document summarization. In: IEEE 17th International Conference on Computational Science and Engineering, pp. 27–31 (2014)
Google Scholar
Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: Dbpedia spotlight: Shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems, pp. 1–8 (2011)
Google Scholar
Nasukawa, T., Nagano, T.: Text analysis and knowledge mining system. IBM Syst. J. 40(4), 967–984 (2001)
Article Google Scholar
Niesler, T., Woodland, P.C.: Variable-length category n-gram language models. Comput. Speech Lang. 13(1), 99–124 (1999)
Article Google Scholar
Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)
Google Scholar
Ratinov, L., Roth, D., Downey, D., Anderson, M.: Local, global algorithms for disambiguation to wikipedia. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1375–1384 (2011)
Google Scholar
Shinzato, K., Shibata, T., Kawahara, D., Kurohashi, S.: Tsubaki: An open search engine infrastructure for developing information access methodology. Inf. Med. Technol. 7(1), 354–365 (2012)
Google Scholar
Zhu, X., Kiritchenko, S., Mohammad, S.M.: Sentiment analysis of short informal texts. J. Artif. Intell. Res. 50, 723–762 (2014)
MATH Google Scholar
Volpe, A.D., Klammer, T.P., Schulz, M.R.: Analyzing English Grammar. Longman, New York (2009)
Google Scholar
Tckstrm, O., Das, D., Petrov, S., McDonald, R., Nivre, J.: Token and type constraints for cross-lingual part-of-speech tagging. Trans. Assoc. Comput. Linguist. 1, 1–12 (2013)
Google Scholar
Wang, M., Manning, C.D.: Cross-lingual projected expectation regularization for weakly supervised learning. TACL 2, 55–66 (2014)
Google Scholar
Winkler, E.: Understanding Language: A Basic Course in Linguistics. Continuum, London (2007)
Google Scholar
Wisniewski, G., Pécheux, N., Gahbiche-Braham, S., Yvon, F.: Cross-lingual part-of-speech tagging through ambiguous learning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1779–1785 (2014)
Google Scholar
Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the First International Conference on Human Language Technology Research, pp. 1–8 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

SOKENDAI (The Graduate University for Advanced Studies), Hayama, Japan
Md-Mizanur Rahoman & Ryutaro Ichise
IBM Research – Tokyo, Tokyo, Japan
Tetsuya Nasukawa & Hiroshi Kanayama
National Institute of Informatics, Tokyo, Japan
Ryutaro Ichise

Authors

Md-Mizanur Rahoman
View author publications
You can also search for this author in PubMed Google Scholar
Tetsuya Nasukawa
View author publications
You can also search for this author in PubMed Google Scholar
Hiroshi Kanayama
View author publications
You can also search for this author in PubMed Google Scholar
Ryutaro Ichise
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md-Mizanur Rahoman .

Editor information

Editors and Affiliations

Universidad Pablo de Olavide, Sevilla, Spain
Francisco Martínez-Álvarez
Universidad Pablo de Olavide, Sevilla, Spain
Alicia Troncoso
University of Salamanca, Salamanca, Spain
Héctor Quintián
University of Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rahoman, MM., Nasukawa, T., Kanayama, H., Ichise, R. (2016). LiCord: Language Independent Content Word Finder. In: Martínez-Álvarez, F., Troncoso, A., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2016. Lecture Notes in Computer Science(), vol 9648. Springer, Cham. https://doi.org/10.1007/978-3-319-32034-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-32034-2_4
Published: 14 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32033-5
Online ISBN: 978-3-319-32034-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics