Towards Indonesian Phrase Extraction: Framework and Corpus

Lin, Xiaotian; Lin, Nankai; Xiao, Lixian; Jiang, Shengyi; Qiu, Xinying

doi:10.1007/978-981-16-9709-8_3

Xiaotian Lin¹⁴,
Nankai Lin¹⁴,
Lixian Xiao^15,16,
Shengyi Jiang^14,15 &
…
Xinying Qiu^14,15

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1496))

Included in the following conference series:

CCF Conference on Big Data

1129 Accesses

Abstract

Mining quality phrases is one of the basic tasks of natural language processing. Current research mainly focuses on universal languages but is rarely conducted for low-resource languages such as Indonesian. To the best of our knowledge, there is no evaluation dataset available for phrase extraction task in Indonesian. Phrase extraction is a challenging task for Indonesian due to the lack of language analyzing tools and large data set. Therefore, we propose a framework to construct Indonesian phrase extraction corpus using Wikipedia as high-quality resource and match extracted phrases with our POS-tagged corpus. Our linguistic experts manually classified the extracted POS patterns. With the annotated patterns, we re-extract phrases and construct a corpus with 8379 Indonesian phrases in total. In addition, we experiment with three deep learning models achieved superior performances for phrase extraction and finalize the baselines for Indonesian phrase extraction task.

X. Lin and N. Lin—The co-first authors. They have worked together and contributed equally to the paper

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unsupervised Key-Phrase Extraction from Long Texts with Multilingual Sentence Transformers

Reusable Phrase Extraction Based on Syntactic Parsing

Using machine learning to build POS tagger for under-resourced language: the case of Somali

Article 03 June 2020

Notes

1.
https://id.wikipedia.org/.

References

Inurrieta, U., Aduriz, I., Díaz de Ilarraza, A., Labaka, G., Sarasola, K.: Learning about phraseology from corpora: a linguistically motivated approach for multiword expression identification. Plos one 15(8), e0237767 (2020)
Google Scholar
Taslimipoor, S., Rohanian, O.: Shoma at parseme shared task on automatic identification of vmwes: Neural multiword expression tagging with high generalisation. CORR (2018)
Google Scholar
Liu, J., Shang, J., Wang, C., Ren, X., Han, J.: Mining quality phrases from massive text corpora. In: Proceedings of 2015 ACM SIGMOD International Conference on Management of Data, pp. 1729–1744. (2015)
Google Scholar
Shang, J., Liu, J., Jiang, M., Ren, X., Voss, C.R., Han, J.: Automated phrase mining from massive text corpora. IEEE Trans. Knowl. Data Eng. 30(10), 1825–1837 (2018)
Article Google Scholar
Altenbek, G., Sun, R.: Kazakh noun phrase extraction based on n-gram and rules. In: 2010 International Conference on Asian Language Processing, pp. 305–308. IEEE (2010)
Google Scholar
Ahmad, K., Gillam, L., Tostevin, L.: University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder). In: The Eighth Text REtrieval Conference (TREC-8), pp. 1–8. Gaithersburg, Maryland (1999)
Google Scholar
Rohanian, O., Taslimipoor, S., Kouchaki, S., Ha, L. A., Mitkov, R.: Bridging the gap: attending to discontinuity in identification of multiword expressions. CORR (2019)
Google Scholar
Qiu, X., Chen, H., Chen, Y., et al.: Automatic recognition of Indonesian compound noun phrases with a combination of self-attention mechanism and n-gram convolution kernel. J. Hunan Univ. Technol. 34(3), 1–9 (2020)
Google Scholar
Birke, J., and Sarkar, A.: A clustering approach for nearly unsupervised recognition of nonliteral language. In: 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 329–336 (2006)
Google Scholar
Hashimoto, C., Kawahara, D.: Construction of an idiom corpus and its application to idiom identification based on WSD incorporating idiom-specific features. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 992–1001 (2008)
Google Scholar
Fu, S., Lin, N., Zhu, G., Jiang, S.: Towards indonesian part-of-speech tagging: corpus and models. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), France (2018)
Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Proc. 45, 2673–2681 (1997)
Article Google Scholar
Yao, Y., Huang, Z.: Bi-directional LSTM recurrent neural network for Chinese word segmentation. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds.) ICONIP 2016. LNCS, vol. 9950, pp. 345–353. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46681-1_42
Chapter Google Scholar
Kim, Y.: Convolutional Neural Networks for Sentence Classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 174–1751 (2014)
Google Scholar
Devlin, J., Chang, M. W., Lee K., Toutanova K.: BERT: Pre-training of deep bidirectional transformers for language understanding, In: Proceedings of NAACLHLT 2019, pp. 4171–4186 (2019)
Google Scholar
Heinzerling, B. and Strube M. BPEmb: Tokenization-free pre-trained subword embeddings in 275 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation, European Language Resources Association (ELRA) (2018)
Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CORR (2014)
Google Scholar
Sepp, H., Jürgen, S.: Long short term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Key Field Project for Universities of Guangdong Province (No. 2019KZDZX1016), the National Natural Science Foundation of China (No. 61572145) and the National Social Science Foundation of China (No. 17CTQ045).

Author information

Authors and Affiliations

School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, Guangdong, China
Xiaotian Lin, Nankai Lin, Shengyi Jiang & Xinying Qiu
Guangzhou Key Laboratory of Multilingual Intelligent Processing, Guangdong University of Foreign Studies, Guangzhou, Guangdong, China
Lixian Xiao, Shengyi Jiang & Xinying Qiu
Asian Languages and Cultures, Guangdong University of Foreign Studies, Guangzhou, Guangdong, China
Lixian Xiao

Authors

Xiaotian Lin
View author publications
You can also search for this author in PubMed Google Scholar
Nankai Lin
View author publications
You can also search for this author in PubMed Google Scholar
Lixian Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Shengyi Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Xinying Qiu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National University of Defense Technology, Changsha, China
Xiangke Liao
Shenzhen University of Technology, Chinese Academy of Sciences, Shenzhen, China
Wei Zhao
University of Science and Technology of China, Hefei, China
Enhong Chen
Sun Yat-sen University, Guangzhou, China
Nong Xiao
Taiyuan University of Technology, Taiyuan, China
Li Wang
Nanjing University, Nanjing, China
Yang Gao
Nanjing University, Nanjing, China
Yinghuan Shi
Sun Yat-sen University, Guangzhou, China
Changdong Wang
Sun Yat-sen University, Guangzhou, China
Dan Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, X., Lin, N., Xiao, L., Jiang, S., Qiu, X. (2022). Towards Indonesian Phrase Extraction: Framework and Corpus. In: Liao, X., et al. Big Data. BigData 2021. Communications in Computer and Information Science, vol 1496. Springer, Singapore. https://doi.org/10.1007/978-981-16-9709-8_3

Download citation

DOI: https://doi.org/10.1007/978-981-16-9709-8_3
Published: 15 January 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-9708-1
Online ISBN: 978-981-16-9709-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)