skip to main content
research-article

Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information Retrieval

Published: 22 November 2023 Publication History

Abstract

Expression-level information extraction is a challenging task in natural language processing (NLP), which aims to retrieve crucial semantic information from linguistic documents. However, there is a lack of up-to-date data resources for accelerating expression-level information extraction, particularly in the Chinese financial high technology field. To address this gap, we introduce Fintech Key-Phrase: a human-annotated key-phrase dataset for the Chinese financial high technology domain. This dataset comprises over 12K paragraphs along with annotated domain-specific key-phrases. We extract the publicly released reports, Chinese management’s discussion and analysis (CMD&A), from the renowned Chinese research data services platform (CNRDS) and then filter the reports related to high technology. The high technology key-phrases are annotated following pre-defined philosophy guidelines to ensure annotation quality. In order to better understand the limitations and challenges in the purposed dataset, we conducted comprehensive noise evaluation experiments for the Fintech Key-Phrase, including annotation consistency assessment and absolute annotation quality evaluation. To demonstrate the usefulness of our released Fintech Key-Phrase in retrieving valuable information in the Chinese financial high technology field, we evaluate its significance using several superior information retrieval systems as representative baselines and report corresponding performance statistics. Additionally, we further applied ChatGPT to the text augmentation approach of the Fintech Key-Phrase dataset. Extensive comparative experiments demonstrate that the augmented Fintech Key-Phrase dataset significantly improved the coverage and accuracy of extracting key phrases in the finance and high-tech domains. We believe that this dataset can facilitate scientific research and exploration in the Chinese financial high technology field. We have made the Fintech Key-Phrase dataset and the experimental code of the adopted baselines accessible on Github: https://github.com/albert-jin/Fintech-Key-Phrase. To encourage newcomers to participate in the financial high-tech domain information retrieval research, we have developed a series of tools, including an open website and corresponding real-time information retrieval APIs.

References

[1]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, 6000–6010.
[2]
Darina Benikova, Chris Biemann, and Marc Reznicek. 2014. NoSta-D named entity annotation for german: Guidelines and dataset. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), Reykjavik, Iceland, 2524–2531. http://www.lrec-conf.org/proceedings/lrec2014/pdf/276_Paper.pdf
[3]
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, and Ziqing Yang. 2021. Pre-training with whole word masking for chinese BERT. IEEE Transactions on Audio, Speech and Language Processing 29 (2021), 3504–3514. DOI:
[4]
Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke Huang, Yihan Cao, Zihao Wu, Lin Zhao, Shaochen Xu, Wei Liu, Ninghao Liu, Sheng Li, Dajiang Zhu, Hongmin Cai, Lichao Sun, Quanzheng Li, Dinggang Shen, Tianming Liu, and Xiang Li. 2023. AugGPT: Leveraging ChatGPT for Text Data Augmentation. arXiv:2302.13007. Retrieved from https://arxiv.org/abs/2302.13007
[5]
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 4685–4694. DOI:
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.
[7]
Sean R. Eddy. 1996. Hidden Markov models. Current Opinion in Structural Biology 6, 3 (1996), 361–365. DOI:
[8]
Hao Fei, Yafeng Ren, and Donghong Ji. 2020. Dispatched attention with multi-task learning for nested mention recognition. Information Sciences 513 (2020), 241–251. DOI:
[9]
Jr. G. D. Forney. 1973. The viterbi algorithm. Proc. IEEE 61, 3 (1973), 268–278.
[10]
Jianqi Gao, Hang Yu, and Shuang Zhang. 2022. Joint event causality extraction using dual-channel enhanced neural network. Knowledge-Based Systems 258 (2022), 109935.
[11]
Jun Gao, Huan Zhao, Changlong Yu, and Ruifeng Xu. 2023. Exploring the Feasibility of ChatGPT for Event Extraction. arXiv:2303.03836. Retrieved from https://arxiv.org/abs/2303.03836
[12]
Soumitra Ghosh, Asif Ekbal, and Pushpak Bhattacharyya. 2021. A multitask framework to detect depression, sentiment and multi-label emotion from suicide notes. Cognitive Computation 14, 1 (2021), 110–129. DOI:
[13]
Soumitra Ghosh, Swarup Roy, Asif Ekbal, and Pushpak Bhattacharyya. 2022. CARES: CAuse recognition for emotion in suicide notes. In Proceedings of the Advances in Information Retrieval. Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.), Springer International Publishing, Cham, 128–136.
[14]
Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Yoshua Bengio and Yann LeCun (Eds.), Vol. 9, PMLR, 249–256.
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-level Performance on ImageNet Classification. In 2015 IEEE International Conference on Computer Vision (ICCV). 1026–1034.
[16]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9 (1997), 1735–80. DOI:
[17]
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv:1508.01991. Retrieved from https://arxiv.org/abs/1508.01991
[18]
Pennington Jeffrey, Socher Richard, and Manning Christopher. 2014. Global vectors for word representation. In Proceedings of the EMNLP. 1532–1543.
[19]
Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing Wang, and Zhaopeng Tu. 2023. Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine. arXiv:2301.08745. Retrieved from https://arxiv.org/abs/2301.08745
[20]
Weiqiang Jin, Biao Zhao, Hang Yu, Xi Tao, Ruiping Yin, and Guizhong Liu. 2022. Improving embedded knowledge graph multi-hop question answering by introducing relational chain reasoning. Data Mining and Knowledge Discovery 37, 1 (11 Nov 2022), 255–288.
[21]
Weiqiang Jin, Biao Zhao, Liwen Zhang, Chenxing Liu, and Hang Yu. 2023. Back to common sense: Oxford dictionary descriptive knowledge augmentation for aspect-based sentiment analysis. Information Processing and Management 60, 3 (2023), 103260.
[22]
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, 282–289.
[23]
Chih-Hen Lee, Yi-Shyuan Chiang, and Chuan-Ju Wang. 2022. INForex: Interactive news digest for forex investors. In Proceedings of the Advances in Information Retrieval. Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.), Springer International Publishing, Cham, 300–304.
[24]
Yuan Li, Biaoyan Fang, Jiayuan He, Hiyori Yoshikawa, Saber A. Akhondi, Christian Druckenbrodt, Camilo Thorne, Zenan Zhai, Zubair Afzal, Trevor Cohn, Timothy Baldwin, and Karin Verspoor. 2022. The ChEMU 2022 evaluation campaign: Information extraction in chemical patents. In Proceedings of the Advances in Information Retrieval. Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.), Springer International Publishing, Cham, 400–407.
[25]
Hongzhe Liu, Ningwei Wang, Xuewei Li, Cheng Xu, and Yaze Li. 2022. BFF R-CNN: Balanced feature fusion for object detection. IEICE TRANSACTIONS on Information and Systems 105, 8 (2022), 1472–1480.
[26]
Lemao Liu, Haisong Zhang, Haiyun Jiang, Yangming Li, Enbo Zhao, Kun Xu, Linfeng Song, Suncong Zheng, Botong Zhou, Jianchen Zhu, Xiao Feng, Tao Chen, Tao Yang, Dong Yu, Feng Zhang, Zhanhui Kang, and Shuming Shi. 2021. TexSmart: A system for enhanced natural language understanding. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP): System Demonstrations.
[27]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692
[28]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7
[29]
Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1301.3781
[30]
Tuan-Anh Dang Nguyen and Dat Nguyen Thanh. 2019. End-to-end information extraction by character-level embedding and multi-stage attentional U-Net. In Proceedings of the BMVC.
[31]
Rasmus Berg Palm, Dirk Hovy, Florian Laws, and Ole Winther. 2017. End-to-end information extraction without token-level supervision. In Proceedings of the Workshop on Speech-Centric Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 48–52. DOI:
[32]
Jipeng Qiang, Ping Chen, Wei Ding, Tong Wang, Fei Xie, and Xindong Wu. 2019. Heterogeneous-length text topic modeling for reader-aware multi-document summarization. ACM Transactions on Knowledge Discovery from Data 13, 4 (2019), 21 pages. DOI:
[33]
Nicola Ringland. 2015-09-30. Structured Named Entities. Ph. D. Dissertation. The University of Sydney. Retrieved from http://hdl.handle.net/2123/14558
[34]
Yongliang Shen, Xinyin Ma, Zeqi Tan, Shuai Zhang, Wen Wang, and Weiming Lu. 2021. Locate and label: A two-stage identifier for nested named entity recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 2782–2794.
[35]
Yongliang Shen, Xiaobin Wang, Zeqi Tan, Guangwei Xu, Pengjun Xie, Fei Huang, Weiming Lu, and Yueting Zhuang. 2022. Parallel instance query network for named entity recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 947–961.
[36]
Shuyan Sun. 2011. Meta-analysis of Cohen’s kappa. Health Services and Outcomes Research Methodology 11, 3 (2011), 145–163. DOI:
[37]
Ningwei Wang, Yaze Li, and Hongzhe Liu. 2021. Reinforced neighbour feature fusion object detection with deep learning. Symmetry 13, 9 (2021), 1623.
[38]
Yu Wang, Hanghang Tong, Ziye Zhu, and Yun Li. 2022. Nested named entity recognition: A survey. ACM Transactions on Knowledge Discovery from Data 16, 6, Article 108 (2022), 29 pages. DOI:
[39]
Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, Yufeng Chen, Meishan Zhang, Yong Jiang, and Wenjuan Han. 2023. Zero-Shot Information Extraction via Chatting with ChatGPT. arXiv:2302.10205. Retrieved from https://arxiv.org/abs/2302.10205
[40]
Nan Xia, Hang Yu, Yin Wang, Junyu Xuan, and Xiangfeng Luo. 2023. DAFS: A domain aware few shot generative model for event detection. Machine Learning 112, 3 (2023), 1011–1031. DOI:
[41]
Yuquan Xiao and Qinghe Du. 2023. Statistical Age-of-Information Optimization for Status Update over Multi-State Fading Channels. arXiv:2303.11153. Retrieved from https://arxiv.org/abs/2303.11153
[42]
Hang Yan, Bocao Deng, Xiaonan Li, and Xipeng Qiu. 2019. TENER: Adapting Transformer Encoder for Named Entity Recognition. arXiv:1911.04474. Retrieved from https://arxiv.org/abs/1911.04474
[43]
Xiao Yuquan, Du Qinghe, Cheng Wenchi, and Zhang Wei. 2023. Adaptive sampling and transmission for minimizing age of information in metaverse. IEEE Journal on Selected Areas in Communications, Early Access (2023).
[44]
Biao Zhao, Weiqiang Jin, Javier Del Ser, and Guang Yang. 2023. ChatAgri: Exploring potentials of ChatGPT on cross-linguistic agricultural text classification. Neurocomputing 557 (2023), 126708.
[45]
Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and Dacheng Tao. 2023. Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT. arXiv:2302.10198. Retrieved from https://arxiv.org/abs/2302.10198
[46]
GuoDong Zhou, Jie Zhang, Jian Su, Dan Shen, and ChewLim Tan. 2004. Recognizing names in biomedical texts: A machine learning approach. Bioinformatics 20, 7 (2004), 1178–1190. DOI:
[47]
Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, 207–212. DOI:

Cited By

View all
  • (2024)Frequency-Oriented Transformer for Remote Sensing Image DehazingSensors10.3390/s2412397224:12(3972)Online publication date: 19-Jun-2024
  • (2024)Error Pattern Discovery in Spellchecking Using Multi-Class Confusion Matrix Analysis for the Croatian LanguageComputers10.3390/computers1302003913:2(39)Online publication date: 29-Jan-2024
  • (2024)Enhancing aspect-based sentiment analysis with BERT-driven context generation and quality filteringNatural Language Processing Journal10.1016/j.nlp.2024.1000777(100077)Online publication date: Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 11
November 2023
255 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3633309
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 November 2023
Online AM: 01 November 2023
Accepted: 11 October 2023
Revised: 13 June 2023
Received: 28 December 2022
Published in TALLIP Volume 22, Issue 11

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Information retrieval
  2. expression-level information extraction
  3. financial high technology field
  4. Chinese management’s discussion and analysis
  5. ChatGPT-based data augment

Qualifiers

  • Research-article

Funding Sources

  • Shaanxi Key Laboratory of Intelligent Processing for Big Energy Data
  • National Science Fund for Distinguished Young Scholars
  • Major Research Plan of National Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)262
  • Downloads (Last 6 weeks)7
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Frequency-Oriented Transformer for Remote Sensing Image DehazingSensors10.3390/s2412397224:12(3972)Online publication date: 19-Jun-2024
  • (2024)Error Pattern Discovery in Spellchecking Using Multi-Class Confusion Matrix Analysis for the Croatian LanguageComputers10.3390/computers1302003913:2(39)Online publication date: 29-Jan-2024
  • (2024)Enhancing aspect-based sentiment analysis with BERT-driven context generation and quality filteringNatural Language Processing Journal10.1016/j.nlp.2024.1000777(100077)Online publication date: Jun-2024
  • (2024)Multimodal Aspect-Based Sentiment Analysis: A survey of tasks, methods, challenges and future directionsInformation Fusion10.1016/j.inffus.2024.102552112(102552)Online publication date: Dec-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media