ABSTRACT
The lack of large-scale information operation and maintenance corpus greatly limits the development of information technology operation and maintenance management, especially in non-English languages. To improve this situation, in this paper, we introduce a large-scale Chinese information operation and maintenance knowledge retrieval corpus and release it publicly. How to collect a large amount of retrieval corpora in different languages is a key point in building such a corpus. In this paper, we first collect a large amount of Chinese information operation and maintenance knowledge corpus related to high-frequency words in various fields using search engines, and then generate relevant questions for the corpus using ChatGPT (https://chat.openai.com/). Finally, we recruit three annotators to manually check the quality of the retrieval corpus. After this process, we have built a Chinese information operation and maintenance knowledge corpus containing 2000 retrieval questions. To verify the quality of the corpus, we divide it into two parts: a training set containing 1500 retrieval questions and a test set containing 500 retrieval questions, and test several well-known retrieval methods on them (https://pan.baidu.com/s/1rLWqHZJhE9nEOYg3OTC1Ag). The experimental results not only prove the high quality of the corpus but also provide a solid baseline performance for further research on this corpus.
- Kim C, Haas C T, Liapi K A. Rapid, on-site spatial information acquisition and its use for infrastructure operation and maintenance[J]. Automation in Construction, 2005, 14(5): 666-684.Google ScholarCross Ref
- Yang L, Li G, Zhang Z, Operations & maintenance optimization of wind turbines integrating wind and aging information[J]. IEEE Transactions on Sustainable Energy, 2020, 12(1): 211-221.Google ScholarCross Ref
- Gao X, Pishdad-Bozorgi P. BIM-enabled facilities operation and maintenance: A review[J]. Advanced engineering informatics, 2019, 39: 227-247.Google ScholarDigital Library
- Kou L, Li Y, Zhang F, Review on monitoring, operation and maintenance of smart offshore wind farms[J]. Sensors, 2022, 22(8): 2822.Google ScholarCross Ref
- Zhu C, Du X, Zhao E, Research on Preprocessing Method for Massive Operations and Maintenance Data Based on Fuzzy Correlation[C]. 2023 4th International Conference on Intelligent Computing and Human-Computer Interaction (ICHCI). IEEE, 2023: 395-398.Google Scholar
- Jia J, Fu H, Zhang Z, Diagnosis of power operation and maintenance records based on pre-training model and prompt learning[C]. 2022 21st International Symposium on Distributed Computing and Applications for Business Engineering and Science (DCABES). IEEE, 2022: 58-61.Google Scholar
- Lu W, Zhang X, Lu H, Deep hierarchical encoding model for sentence semantic matching[J]. Journal of Visual Communication and Image Representation, 2020, 71: 102794.Google ScholarCross Ref
- Zhang X, Lu W, Li F, Deep feature fusion model for sentence semantic matching[J]. Computers, Materials and Continua, 2019.Google Scholar
- Zhang X, Lu W, Zhang G, Chinese sentence semantic matching based on multi-granularity fusion model[C]. Pacific-Asia Conference on Knowledge Discovery and Data Mining. Cham: Springer International Publishing, 2020: 246-257.Google Scholar
- Zhang J, Liu Y, Ma S, Relevance estimation with multiple information sources on search engine result pages[C]. Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018: 627-636.Google Scholar
- Luo C, Zheng Y, Liu Y, SogouT-16: a new web corpus to embrace IR research[C]. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2017: 1233-1236.Google Scholar
- Wang A, Singh A, Michael J, GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding[C]. International Conference on Learning Representations. 2018.Google ScholarCross Ref
- Kingma D P, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv:1412.6980, 2014.Google Scholar
- Cui Y, Che W, Liu T, Pre-training with whole word masking for chinese bert[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.Google ScholarDigital Library
- Zhang X, Liu Z, Xiang Y, Complicate Then Simplify: A Novel Way to Explore Pre-trained Models for Text Classification[C]. Proceedings of the 29th International Conference on Computational Linguistics. 2022: 1136-1145.Google Scholar
- Cui Y, Che W, Liu T, Revisiting Pre-Trained Models for Chinese Natural Language Processing[C]. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020: 657-668.Google ScholarCross Ref
- Sun Z, Li X, Sun X, ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information[C]. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021: 2065-2075.Google Scholar
Index Terms
- A Chinese Information Operation and Maintenance Knowledge Retrieval Corpus
Recommendations
Exploiting a Chinese-English bilingual wordlist for English-Chinese cross language information retrieval
IRAL '00: Proceedings of the fifth international workshop on on Information retrieval with Asian languagesWe investigated using the LDC English/Chinese bilingual wordlists for English-Chinese cross language retrieval. It is shown that the Chinese-to-English wordlist can be considered as both a phrase and word dictionary, and is preferable to the English-to-...
A Chinese dictionary construction algorithm for information retrieval
In this article we propose a method for constructing, from raw Chinese text, a statistics-based automatic dictionary. The method makes use of local statistical information (i.e., data within a document) to identify and discard repeated string patterns, ...
Corpus-Based statistics of pre-qin chinese
CLSW'12: Proceedings of the 13th Chinese conference on Chinese Lexical SemanticsThe Pre-Qin Chinese plays a key role in the history of Chinese. However, for the lack of annotated corpus, the overview of Pre-Qin Chinese vocabulary is still not clear. This paper introduces the corpus of 25 Pre-Qin classical texts, which are under ...
Comments