Abstract
Mining quality phrases is one of the basic tasks of natural language processing. Current research mainly focuses on universal languages but is rarely conducted for low-resource languages such as Indonesian. To the best of our knowledge, there is no evaluation dataset available for phrase extraction task in Indonesian. Phrase extraction is a challenging task for Indonesian due to the lack of language analyzing tools and large data set. Therefore, we propose a framework to construct Indonesian phrase extraction corpus using Wikipedia as high-quality resource and match extracted phrases with our POS-tagged corpus. Our linguistic experts manually classified the extracted POS patterns. With the annotated patterns, we re-extract phrases and construct a corpus with 8379 Indonesian phrases in total. In addition, we experiment with three deep learning models achieved superior performances for phrase extraction and finalize the baselines for Indonesian phrase extraction task.
X. Lin and N. Lin—The co-first authors. They have worked together and contributed equally to the paper
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Inurrieta, U., Aduriz, I., DÃaz de Ilarraza, A., Labaka, G., Sarasola, K.: Learning about phraseology from corpora: a linguistically motivated approach for multiword expression identification. Plos one 15(8), e0237767 (2020)
Taslimipoor, S., Rohanian, O.: Shoma at parseme shared task on automatic identification of vmwes: Neural multiword expression tagging with high generalisation. CORR (2018)
Liu, J., Shang, J., Wang, C., Ren, X., Han, J.: Mining quality phrases from massive text corpora. In: Proceedings of 2015 ACM SIGMOD International Conference on Management of Data, pp. 1729–1744. (2015)
Shang, J., Liu, J., Jiang, M., Ren, X., Voss, C.R., Han, J.: Automated phrase mining from massive text corpora. IEEE Trans. Knowl. Data Eng. 30(10), 1825–1837 (2018)
Altenbek, G., Sun, R.: Kazakh noun phrase extraction based on n-gram and rules. In: 2010 International Conference on Asian Language Processing, pp. 305–308. IEEE (2010)
Ahmad, K., Gillam, L., Tostevin, L.: University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder). In: The Eighth Text REtrieval Conference (TREC-8), pp. 1–8. Gaithersburg, Maryland (1999)
Rohanian, O., Taslimipoor, S., Kouchaki, S., Ha, L. A., Mitkov, R.: Bridging the gap: attending to discontinuity in identification of multiword expressions. CORR (2019)
Qiu, X., Chen, H., Chen, Y., et al.: Automatic recognition of Indonesian compound noun phrases with a combination of self-attention mechanism and n-gram convolution kernel. J. Hunan Univ. Technol. 34(3), 1–9 (2020)
Birke, J., and Sarkar, A.: A clustering approach for nearly unsupervised recognition of nonliteral language. In: 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 329–336 (2006)
Hashimoto, C., Kawahara, D.: Construction of an idiom corpus and its application to idiom identification based on WSD incorporating idiom-specific features. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 992–1001 (2008)
Fu, S., Lin, N., Zhu, G., Jiang, S.: Towards indonesian part-of-speech tagging: corpus and models. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), France (2018)
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Proc. 45, 2673–2681 (1997)
Yao, Y., Huang, Z.: Bi-directional LSTM recurrent neural network for Chinese word segmentation. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds.) ICONIP 2016. LNCS, vol. 9950, pp. 345–353. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46681-1_42
Kim, Y.: Convolutional Neural Networks for Sentence Classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 174–1751 (2014)
Devlin, J., Chang, M. W., Lee K., Toutanova K.: BERT: Pre-training of deep bidirectional transformers for language understanding, In: Proceedings of NAACLHLT 2019, pp. 4171–4186 (2019)
Heinzerling, B. and Strube M. BPEmb: Tokenization-free pre-trained subword embeddings in 275 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation, European Language Resources Association (ELRA) (2018)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CORR (2014)
Sepp, H., Jürgen, S.: Long short term memory. Neural Comput. 9(8), 1735–1780 (1997)
Acknowledgements
This work was supported by the Key Field Project for Universities of Guangdong Province (No. 2019KZDZX1016), the National Natural Science Foundation of China (No. 61572145) and the National Social Science Foundation of China (No. 17CTQ045).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Lin, X., Lin, N., Xiao, L., Jiang, S., Qiu, X. (2022). Towards Indonesian Phrase Extraction: Framework and Corpus. In: Liao, X., et al. Big Data. BigData 2021. Communications in Computer and Information Science, vol 1496. Springer, Singapore. https://doi.org/10.1007/978-981-16-9709-8_3
Download citation
DOI: https://doi.org/10.1007/978-981-16-9709-8_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-9708-1
Online ISBN: 978-981-16-9709-8
eBook Packages: Computer ScienceComputer Science (R0)