Skip to main content

Towards Indonesian Phrase Extraction: Framework and Corpus

  • Conference paper
  • First Online:
Big Data (BigData 2021)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1496))

Included in the following conference series:

  • 1094 Accesses

Abstract

Mining quality phrases is one of the basic tasks of natural language processing. Current research mainly focuses on universal languages but is rarely conducted for low-resource languages such as Indonesian. To the best of our knowledge, there is no evaluation dataset available for phrase extraction task in Indonesian. Phrase extraction is a challenging task for Indonesian due to the lack of language analyzing tools and large data set. Therefore, we propose a framework to construct Indonesian phrase extraction corpus using Wikipedia as high-quality resource and match extracted phrases with our POS-tagged corpus. Our linguistic experts manually classified the extracted POS patterns. With the annotated patterns, we re-extract phrases and construct a corpus with 8379 Indonesian phrases in total. In addition, we experiment with three deep learning models achieved superior performances for phrase extraction and finalize the baselines for Indonesian phrase extraction task.

X. Lin and N. Lin—The co-first authors. They have worked together and contributed equally to the paper

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://id.wikipedia.org/.

References

  1. Inurrieta, U., Aduriz, I., Díaz de Ilarraza, A., Labaka, G., Sarasola, K.: Learning about phraseology from corpora: a linguistically motivated approach for multiword expression identification. Plos one 15(8), e0237767 (2020)

    Google Scholar 

  2. Taslimipoor, S., Rohanian, O.: Shoma at parseme shared task on automatic identification of vmwes: Neural multiword expression tagging with high generalisation. CORR (2018)

    Google Scholar 

  3. Liu, J., Shang, J., Wang, C., Ren, X., Han, J.: Mining quality phrases from massive text corpora. In: Proceedings of 2015 ACM SIGMOD International Conference on Management of Data, pp. 1729–1744. (2015)

    Google Scholar 

  4. Shang, J., Liu, J., Jiang, M., Ren, X., Voss, C.R., Han, J.: Automated phrase mining from massive text corpora. IEEE Trans. Knowl. Data Eng. 30(10), 1825–1837 (2018)

    Article  Google Scholar 

  5. Altenbek, G., Sun, R.: Kazakh noun phrase extraction based on n-gram and rules. In: 2010 International Conference on Asian Language Processing, pp. 305–308. IEEE (2010)

    Google Scholar 

  6. Ahmad, K., Gillam, L., Tostevin, L.: University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder). In: The Eighth Text REtrieval Conference (TREC-8), pp. 1–8. Gaithersburg, Maryland (1999)

    Google Scholar 

  7. Rohanian, O., Taslimipoor, S., Kouchaki, S., Ha, L. A., Mitkov, R.: Bridging the gap: attending to discontinuity in identification of multiword expressions. CORR (2019)

    Google Scholar 

  8. Qiu, X., Chen, H., Chen, Y., et al.: Automatic recognition of Indonesian compound noun phrases with a combination of self-attention mechanism and n-gram convolution kernel. J. Hunan Univ. Technol. 34(3), 1–9 (2020)

    Google Scholar 

  9. Birke, J., and Sarkar, A.: A clustering approach for nearly unsupervised recognition of nonliteral language. In: 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 329–336 (2006)

    Google Scholar 

  10. Hashimoto, C., Kawahara, D.: Construction of an idiom corpus and its application to idiom identification based on WSD incorporating idiom-specific features. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 992–1001 (2008)

    Google Scholar 

  11. Fu, S., Lin, N., Zhu, G., Jiang, S.: Towards indonesian part-of-speech tagging: corpus and models. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), France (2018)

    Google Scholar 

  12. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Proc. 45, 2673–2681 (1997)

    Article  Google Scholar 

  13. Yao, Y., Huang, Z.: Bi-directional LSTM recurrent neural network for Chinese word segmentation. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds.) ICONIP 2016. LNCS, vol. 9950, pp. 345–353. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46681-1_42

    Chapter  Google Scholar 

  14. Kim, Y.: Convolutional Neural Networks for Sentence Classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 174–1751 (2014)

    Google Scholar 

  15. Devlin, J., Chang, M. W., Lee K., Toutanova K.: BERT: Pre-training of deep bidirectional transformers for language understanding, In: Proceedings of NAACLHLT 2019, pp. 4171–4186 (2019)

    Google Scholar 

  16. Heinzerling, B. and Strube M. BPEmb: Tokenization-free pre-trained subword embeddings in 275 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation, European Language Resources Association (ELRA) (2018)

    Google Scholar 

  17. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CORR (2014)

    Google Scholar 

  18. Sepp, H., Jürgen, S.: Long short term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the Key Field Project for Universities of Guangdong Province (No. 2019KZDZX1016), the National Natural Science Foundation of China (No. 61572145) and the National Social Science Foundation of China (No. 17CTQ045).

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lin, X., Lin, N., Xiao, L., Jiang, S., Qiu, X. (2022). Towards Indonesian Phrase Extraction: Framework and Corpus. In: Liao, X., et al. Big Data. BigData 2021. Communications in Computer and Information Science, vol 1496. Springer, Singapore. https://doi.org/10.1007/978-981-16-9709-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-9709-8_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-9708-1

  • Online ISBN: 978-981-16-9709-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics