Skip to main content

Automated Context-Aware Phrase Mining from Text Corpora

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12682))

Included in the following conference series:

Abstract

Phrase mining aims to automatically extract high-quality phrases from a given corpus, which serves as the essential step in transforming unstructured text into structured information. Existing statistic-based methods have achieved the state-of-the-art performance of this task. However, such methods often heavily rely on statistical signals to extract quality phrases, ignoring the effect of contextual information.

In this paper, we propose a novel context-aware method for automated phrase mining, ConPhrase, which formulates phrase mining as a sequence labeling problem with consideration of contextual information. Meanwhile, to tackle the global information scarcity issue and the noisy data filtration issue, our ConPhrase method designs two modules, respectively: 1) a topic-aware phrase recognition network that incorporates domain-related topic information into word representation learning for identifying quality phrases effectively. 2) an instance selection network that focuses on choosing correct sentences with reinforcement learning for further improving the prediction performance of phrase recognition network. Experimental results demonstrate that our ConPhrase outperforms the state-of-the-art approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Reinsel, D., Gantz, J., Rydning, J.: The digitization of the world from edge to core. IDC, Framingham, MA (2018)

    Google Scholar 

  2. Li, K., Zha, H., Su, Y., Yan, X.: Concept mining via embedding. In: 2018 IEEE International Conference on Data Mining (ICDM), pp. 267–276 (2018)

    Google Scholar 

  3. Liu, L., et al.: Empower sequence labeling with task-aware neural language model. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pp. 5253–5260 (2018)

    Google Scholar 

  4. Shang, J., Liu, L., Gu, X., Ren, X., Ren, T., Han, J.W.: Learning named entity tagger using domain-specific dictionary. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2054–2064 (2018)

    Google Scholar 

  5. Safranchik, E., et al.: Weakly supervised sequence tagging from noisy rules. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5570–5578 (2020)

    Google Scholar 

  6. Chen, J., Zhang, X., Wu, Y., Yan, Z., Li, Z.: Keyphrase generation with correlation constraints. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4057–4066 (2018)

    Google Scholar 

  7. Wang, C., et al.: A phrase mining framework for recursive construction of a topical hierarchy. In: Proceedings of the 19th ACM SIGKDD, pp. 437–445 (2013)

    Google Scholar 

  8. Ahmed, E.-K., Song, Y.L., Wang, C., Clare, R.V., Han, J.W.: Scalable topical phrase mining from text corpora. Proc. VLDB Endow. 8(3), 305–316 (2014)

    Article  Google Scholar 

  9. Li, B., Wang, B., Zhou, R., Yang, X.C., Liu, C.F.: A cluster-based iterative topical phrase mining framework. In: International Conference on Database Systems for Advanced Applications (DASFAA), pp. 197–213 (2016)

    Google Scholar 

  10. Shen, J.M., et al.: Hiexpan: task-guided taxonomy construction by hierarchical tree expansion. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2180–2189 (2018)

    Google Scholar 

  11. Liu, J.L., Shang, J.B., Wang, C., Ren, X., Han, J.W.: Mining quality phrases from massive text corpora. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1729–1744 (2015)

    Google Scholar 

  12. Shang, J.B., Liu, J.L., Jiang, M., Ren, X., Voss, R.V., Han, J.W.: Automated phrase mining from massive text corpora. IEEE Trans. Knowl. Data Eng. 30(10), 1825–1837 (2018)

    Article  Google Scholar 

  13. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(1), 993–1022 (2003)

    MATH  Google Scholar 

  14. Seo, M., Kembhavi, A., Farhadi, A., Hajishirzi, H: Bidirectional attention flow for machine comprehension. In: Proceedings of the International Conference on Learning Representations (ICLR) (2017)

    Google Scholar 

  15. Wei, P., Mao, W., Chen, G.: A topic-aware reinforced model for weakly supervised stance detection. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence, pp. 7249–7256 (2019)

    Google Scholar 

  16. Feng, J., Huang, M., Zhao, L., Yang, Y., Zhu, X.: Reinforcement learning for relation classification from noisy data. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pp. 5779–5786 (2018)

    Google Scholar 

  17. Yang, Y., Chen, W., Li, Z., He, Z., Zhang, M.: Distantly supervised NER with partial annotation learning and reinforcement learning. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2159–2169 (2018)

    Google Scholar 

  18. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of the Conference on Neural Information Processing Systems, pp. 1057–1063 (1999)

    Google Scholar 

  19. Li, J., et al.: Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database (2016)

    Google Scholar 

  20. Pyysalo, S., Ginter, F., Moen, H., Salakoski, T., Ananiadou S.: Distributional semantics resources for biomedical text processing. In: Proceedings of the 5th International Symposium on Languages in Biology and Medicine, pp. 39–43 (2013)

    Google Scholar 

  21. Clahsen, H., Felser, C.: Grammatical processing in language learners. Appl. Psycholinguist. 27(1), 3–42 (2006)

    Article  Google Scholar 

  22. Deane, P.: A nonparametric method for extraction of candidate phrasal terms. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 605–613 (2005)

    Google Scholar 

  23. Pitler, E., Bergsma, S., Lin, D., Church, K.W.: Using web-scale n-grams to improve base NP parsing performance. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pp. 886–894 (2010)

    Google Scholar 

  24. Parameswaran, A.G., Garcia-Molina, H., Rajaraman, A.: Towards the web of concepts: extracting concepts from large datasets. PVLDB. 3(1), 566–577 (2010)

    Google Scholar 

  25. Li, B., Yang, X., Wang, B., Cui, W.: Efficiently mining high quality phrases from texts. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, pp. 3474–3481 (2017)

    Google Scholar 

  26. Li, B., Yang, X., Zhou, R., Wang, B., Liu, C., Zhang, Y.: An efficient method for high quality and cohesive topical phrase mining. IEEE Trans. Knowl. Data Eng. 31(1), 120–137 (2018)

    Article  Google Scholar 

  27. Wang, L., et al.: Mining infrequent high-quality phrases from domain-specific corpora. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1535–1544 (2020)

    Google Scholar 

  28. Tian, S., Mo, S., Wang, L., Peng, Z.: Deep reinforcement learning-Based approach to tackle topic-aware influence maximization. Data Sci. Eng. 5(1), 1–11 (2020)

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by National Key R & D Program of China (No.2018YFB1004401) and NSFC under the grant No. 61772537, 61772536, 61702522, 61532021, 62072460.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cuiping Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, X., Li, Q., Li, C., Chen, H. (2021). Automated Context-Aware Phrase Mining from Text Corpora. In: Jensen, C.S., et al. Database Systems for Advanced Applications. DASFAA 2021. Lecture Notes in Computer Science(), vol 12682. Springer, Cham. https://doi.org/10.1007/978-3-030-73197-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-73197-7_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-73196-0

  • Online ISBN: 978-3-030-73197-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics