Abstract
Phrase mining aims to automatically extract high-quality phrases from a given corpus, which serves as the essential step in transforming unstructured text into structured information. Existing statistic-based methods have achieved the state-of-the-art performance of this task. However, such methods often heavily rely on statistical signals to extract quality phrases, ignoring the effect of contextual information.
In this paper, we propose a novel context-aware method for automated phrase mining, ConPhrase, which formulates phrase mining as a sequence labeling problem with consideration of contextual information. Meanwhile, to tackle the global information scarcity issue and the noisy data filtration issue, our ConPhrase method designs two modules, respectively: 1) a topic-aware phrase recognition network that incorporates domain-related topic information into word representation learning for identifying quality phrases effectively. 2) an instance selection network that focuses on choosing correct sentences with reinforcement learning for further improving the prediction performance of phrase recognition network. Experimental results demonstrate that our ConPhrase outperforms the state-of-the-art approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Reinsel, D., Gantz, J., Rydning, J.: The digitization of the world from edge to core. IDC, Framingham, MA (2018)
Li, K., Zha, H., Su, Y., Yan, X.: Concept mining via embedding. In: 2018 IEEE International Conference on Data Mining (ICDM), pp. 267–276 (2018)
Liu, L., et al.: Empower sequence labeling with task-aware neural language model. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pp. 5253–5260 (2018)
Shang, J., Liu, L., Gu, X., Ren, X., Ren, T., Han, J.W.: Learning named entity tagger using domain-specific dictionary. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2054–2064 (2018)
Safranchik, E., et al.: Weakly supervised sequence tagging from noisy rules. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5570–5578 (2020)
Chen, J., Zhang, X., Wu, Y., Yan, Z., Li, Z.: Keyphrase generation with correlation constraints. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4057–4066 (2018)
Wang, C., et al.: A phrase mining framework for recursive construction of a topical hierarchy. In: Proceedings of the 19th ACM SIGKDD, pp. 437–445 (2013)
Ahmed, E.-K., Song, Y.L., Wang, C., Clare, R.V., Han, J.W.: Scalable topical phrase mining from text corpora. Proc. VLDB Endow. 8(3), 305–316 (2014)
Li, B., Wang, B., Zhou, R., Yang, X.C., Liu, C.F.: A cluster-based iterative topical phrase mining framework. In: International Conference on Database Systems for Advanced Applications (DASFAA), pp. 197–213 (2016)
Shen, J.M., et al.: Hiexpan: task-guided taxonomy construction by hierarchical tree expansion. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2180–2189 (2018)
Liu, J.L., Shang, J.B., Wang, C., Ren, X., Han, J.W.: Mining quality phrases from massive text corpora. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1729–1744 (2015)
Shang, J.B., Liu, J.L., Jiang, M., Ren, X., Voss, R.V., Han, J.W.: Automated phrase mining from massive text corpora. IEEE Trans. Knowl. Data Eng. 30(10), 1825–1837 (2018)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(1), 993–1022 (2003)
Seo, M., Kembhavi, A., Farhadi, A., Hajishirzi, H: Bidirectional attention flow for machine comprehension. In: Proceedings of the International Conference on Learning Representations (ICLR) (2017)
Wei, P., Mao, W., Chen, G.: A topic-aware reinforced model for weakly supervised stance detection. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence, pp. 7249–7256 (2019)
Feng, J., Huang, M., Zhao, L., Yang, Y., Zhu, X.: Reinforcement learning for relation classification from noisy data. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pp. 5779–5786 (2018)
Yang, Y., Chen, W., Li, Z., He, Z., Zhang, M.: Distantly supervised NER with partial annotation learning and reinforcement learning. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2159–2169 (2018)
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of the Conference on Neural Information Processing Systems, pp. 1057–1063 (1999)
Li, J., et al.: Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database (2016)
Pyysalo, S., Ginter, F., Moen, H., Salakoski, T., Ananiadou S.: Distributional semantics resources for biomedical text processing. In: Proceedings of the 5th International Symposium on Languages in Biology and Medicine, pp. 39–43 (2013)
Clahsen, H., Felser, C.: Grammatical processing in language learners. Appl. Psycholinguist. 27(1), 3–42 (2006)
Deane, P.: A nonparametric method for extraction of candidate phrasal terms. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 605–613 (2005)
Pitler, E., Bergsma, S., Lin, D., Church, K.W.: Using web-scale n-grams to improve base NP parsing performance. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pp. 886–894 (2010)
Parameswaran, A.G., Garcia-Molina, H., Rajaraman, A.: Towards the web of concepts: extracting concepts from large datasets. PVLDB. 3(1), 566–577 (2010)
Li, B., Yang, X., Wang, B., Cui, W.: Efficiently mining high quality phrases from texts. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, pp. 3474–3481 (2017)
Li, B., Yang, X., Zhou, R., Wang, B., Liu, C., Zhang, Y.: An efficient method for high quality and cohesive topical phrase mining. IEEE Trans. Knowl. Data Eng. 31(1), 120–137 (2018)
Wang, L., et al.: Mining infrequent high-quality phrases from domain-specific corpora. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1535–1544 (2020)
Tian, S., Mo, S., Wang, L., Peng, Z.: Deep reinforcement learning-Based approach to tackle topic-aware influence maximization. Data Sci. Eng. 5(1), 1–11 (2020)
Acknowledgments
This work is supported by National Key R & D Program of China (No.2018YFB1004401) and NSFC under the grant No. 61772537, 61772536, 61702522, 61532021, 62072460.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, X., Li, Q., Li, C., Chen, H. (2021). Automated Context-Aware Phrase Mining from Text Corpora. In: Jensen, C.S., et al. Database Systems for Advanced Applications. DASFAA 2021. Lecture Notes in Computer Science(), vol 12682. Springer, Cham. https://doi.org/10.1007/978-3-030-73197-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-73197-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73196-0
Online ISBN: 978-3-030-73197-7
eBook Packages: Computer ScienceComputer Science (R0)