Rule-Based HierarchicalRank: An Unsupervised Approach to Visible Tag Extraction from Semi-structured Chinese Text

Lei, Jicheng; Yu, Jiali; He, Chunhui; Zhang, Chong; Ge, Bin; Bao, Yiping

doi:10.1007/978-3-030-29894-4_15

Jicheng Lei¹⁰,
Jiali Yu¹¹,
Chunhui He ORCID: orcid.org/0000-0003-1505-1620¹²,
Chong Zhang¹²,
Bin Ge¹² &
…
Yiping Bao¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11672))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

Abstract

The large and growing amounts of semi-structured Chinese text present both challenges and opportunities to enhance text mining and knowledge discovery. One such challenge is to automatically extract a small set of visible tag from a document that can accurately reveal the document’s topic and can facilitate fast information processing. Unfortunately, at this stage, there is still a certain gap between the existing methods and truly engineering application.

In order to narrow this gap, we propose Rule-Based HierarchicalRank (RBH), an unsupervised method for visible tag extraction from semi-structured Chinese text via a documents’ title and non-title two levels. In different level, we use inconsistent methods to extract the candidate visible tags. The experiment results show that the performance of the RBH method is far better than all the baseline methods on visible tag extraction task on two distinct experiment datasets. Specifically, On Paper-Dataset, the rule-based HierarchicalRank methods’ precision and F1-score achieves 18.6% and 14.1%, while TOP K = 5. In addition, on Event-Dataset, the best precision of our method is higher 7% than the state-of-the-art method PositionRank with TOP K = 1. Furthermore, the best Recall of RBH achieves 37.7% when TOP K = 5.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images

SeRI: A Dataset for Sub-event Relation Inference from an Encyclopedia

Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application

Article 31 May 2022

Notes

References

Abujbara, A., Arbor, A.: Coherent Citation-Based Summarization of Scientific Papers. Meeting of the Association for Computational Linguistics: Human Language Technologies. DBLP (2011)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the International Conference on Machine Learning (1997)
Google Scholar
Liu, T.Y.: Learning to rank for information retrieval. ACM SIGIR Forum 41(2), 904 (2010)
Google Scholar
Li, Y., Nie, J., Yi, Z., Wang, B., Yan, B., Weng, F.: Contextual recommendation based on text mining. In: International Conference on Computational Linguistics: Posters (2010)
Google Scholar
Caragea, C., Bulgarov, F.A., Godea, A., Gollapalli, S.D.: Citation-enhanced keyphrase extraction from research papers: a supervised approach (2014)
Google Scholar
Wang, M., Zhao, B., Huang, Y.: PTR: phrase-based topical ranking for automatic keyphrase extraction in scientific publications. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds.) ICONIP 2016. LNCS, vol. 9950, pp. 120–128. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46681-1_15
Chapter Google Scholar
Kim, S.N.: Automatic keyphrase extraction from scientific articles. Lang. Resour. Eval. 47(3), 723–742 (2013)
Article Google Scholar
Florescu, C., Caragea, C.: PositionRank: an unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1105–1115 (2017)
Google Scholar
Huang, C.M., Wu, C.Y.: Effects of word assignment in LDA for news topic discovery. In: IEEE International Congress on Big Data (BigData Congress), pp. 374–380. IEEE (2015)
Google Scholar
Zhang, J.N., Wang, S.G., Sun, Q.B., Yang, F.C.: SLA-Aware fault-tolerant approach for transactional composite service. J. Softw. 29(12), 3614–3634 (2018). http://www.jos.org.cn/1000-9825/5313.htm. (in Chinese)
Google Scholar
Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-77094-7_41
Chapter Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Stanford InfoLab (1999)
Google Scholar
Hasan, K.S., Ng, V.: Automatic keyphrase extraction: a survey of the state of the art. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1262–1273 (2014)
Google Scholar
Merrouni, Z.A., Frikh, B., Ouhbi, B.: Automatic keyphrase extraction: an overview of the state of the art. In: 4th IEEE International Colloquium on Information Science and Technology (CiSt), pp. 306–313. IEEE (2016)
Google Scholar
Frank, E., Paynter, G.W., Witten, I.H., et al.: Domain-specific keyphrase extraction. In: International Joint Conference on Artificial Intelligence (1999)
Google Scholar
Turney, P.D.: Learning algorithms for keyphrase extraction. Inf. Retrieval 2(4), 303–336 (2002)
Article Google Scholar
Lopez, P., Romary, L.: HUMB: automatic key term extraction from scientific articles in GROBID. In: Proceedings of International Workshop on Semantic Evaluation, pp. 248–251 (2010)
Google Scholar
Chuang, J., Manning, C.D., Heer, J.: “Without the clutter of unimportant words”: ldescriptive keyphrases for text visualization. ACM Trans. Comput. Hum. Interact. 19(3), 1–29 (2012)
Article Google Scholar
Sheeba, J.I., Vivekanandan, K.: Improved keyword and keyphrase extraction from meeting transcripts. Int. J. Comput. Appl. 52(13), 11–15 (2013)
Google Scholar
Basaldella, M., Antolli, E., Serra, G., Tasso, C.: Bidirectional LSTM recurrent neural network for keyphrase extraction. In: Serra, G., Tasso, C. (eds.) IRCDL 2018. CCIS, vol. 806, pp. 180–187. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73165-0_18
Chapter Google Scholar
Alqaryouti, O., Khwileh, H., Farouk, T., Nabhan, A., Shaalan, K.: Graph-based keyword extraction. In: Shaalan, K., Hassanien, A.E., Tolba, F. (eds.) Intelligent Natural Language Processing: Trends and Applications. SCI, vol. 740, pp. 159–172. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-67056-0_9
Chapter Google Scholar
Zhang, Y., Zincirheywood, N., Milios, E.: Narrative text classification for automatic key phrase extraction in web document corpora (2005)
Google Scholar
Li, J., Zhang, K.: Keyword extraction based on tf/idf for Chinese news document. Wuhan Univ. J. Nat. Sci. 12(5), 917–921 (2007)
Article Google Scholar
Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. In: EMNLP, pp. 404–411 (2004)
Google Scholar
Wan, X., Xiao, J.: Single document keyphrase extraction using neighborhood knowledge. In: National Conference on Artificial Intelligence. AAAI Press (2008)
Google Scholar
Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, 9–11 October 2010, MIT Stata Center, Massachusetts, A meeting of SIGDAT, a Special Interest Group of the ACL. Association for Computational Linguistics (2010)
Google Scholar
Liu, Z., Chen, X., Zheng, Y., Sun, M.: Automatic keyphrase extraction by bridging vocabulary gap. In: Fifteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics (2011)
Google Scholar
Hu, J., Li, S., Yao, Y., Yu, L., Yang, G., Hu, J.: Patent keyword extraction algorithm based on distributed representation for patent classification. Entropy 20(2), 104 (2018)
Article Google Scholar
Naidu, R., Bharti, S.K., Babu, K.S., Mohapatra, R.K.: Text summarization with automatic Keyword extraction in Telugu e-Newspapers. In: Satapathy, S.C., Bhateja, V., Das, S. (eds.) Smart Computing and Informatics. SIST, vol. 77, pp. 555–564. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-5544-7_54
Chapter Google Scholar
Yuan, M., Zou, C.: Text keyword extraction based on meta-learning strategy. In: International Conference on Big Data and Artificial Intelligence (BDAI), pp. 78–81. IEEE (2018)
Google Scholar
Biswas, S.K.: Keyword extraction from tweets using weighted graph. In: Mallick, P.K., Balas, V.E., Bhoi, A.K., Zobaa, A.F. (eds.) Cognitive Informatics and Soft Computing. AISC, vol. 768, pp. 475–483. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-0617-4_47
Chapter Google Scholar
Ge, B., He, C.H., Hu, S.Z., Guo, C.: Chinese news hot subtopic discovery and recommendation method based on key phrase and the LDA model. DEStech Transactions on Engineering and Technology Research, ECAR (2018)
Google Scholar

Download references

Acknowledgment

This paper was supported by the National Natural Science Foundation of China (NSFC) via grant No. 61872446 and Natural Science Foundation of Hunan Province, China via grant No. 2018JJ2475 and No. 2018JJ2476.

Author information

Authors and Affiliations

CETC Big Data Research Institute Co., Ltd., Chengdu, China
Jicheng Lei
Tus-Holdings Co., Ltd., Beijing, China
Jiali Yu
Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha, Hunan, People’s Republic of China
Chunhui He, Chong Zhang & Bin Ge
Guizhou Wingscloud Co. Ltd., Guiyang, China
Yiping Bao

Authors

Jicheng Lei
View author publications
You can also search for this author in PubMed Google Scholar
Jiali Yu
View author publications
You can also search for this author in PubMed Google Scholar
Chunhui He
View author publications
You can also search for this author in PubMed Google Scholar
Chong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Bin Ge
View author publications
You can also search for this author in PubMed Google Scholar
Yiping Bao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunhui He .

Editor information

Editors and Affiliations

Department of Computing, Macquarie University, Sydney, NSW, Australia
Abhaya C. Nayak
RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Alok Sharma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lei, J., Yu, J., He, C., Zhang, C., Ge, B., Bao, Y. (2019). Rule-Based HierarchicalRank: An Unsupervised Approach to Visible Tag Extraction from Semi-structured Chinese Text. In: Nayak, A., Sharma, A. (eds) PRICAI 2019: Trends in Artificial Intelligence. PRICAI 2019. Lecture Notes in Computer Science(), vol 11672. Springer, Cham. https://doi.org/10.1007/978-3-030-29894-4_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-29894-4_15
Published: 23 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29893-7
Online ISBN: 978-3-030-29894-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics