skip to main content
10.1145/3583780.3615003acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

PaperLM: A Pre-trained Model for Hierarchical Examination Paper Representation Learning

Published: 21 October 2023 Publication History

Abstract

Representation learning of examination papers is significantly crucial for online education systems, as it benefits various applications such as estimating paper difficulty and examination paper retrieval. Previous works mainly explore the representation learning of individual questions in an examination paper, with limited attention given to the examination paper as a whole. In fact, the structure of examination papers is strongly correlated with paper properties such as paper difficulty, which existing paper representation methods fail to capture adequately. To this end, we propose a pre-trained model namely PaperLM to learn the representation of examination papers. Our model integrates both the text content and hierarchical structure of examination papers within a single framework by converting the path of the Examination Organization Tree (EOT) into embedding. Furthermore, we specially design three pre-training objectives for PaperLM, namely EOT Node Relationship Prediction (ENRP), Question Type Prediction (QTP) and Paper Contrastive Learning (PCL), aiming to capture features from text and structure effectively. We pre-train our model on a real-world examination paper dataset, and then evaluate the model with three down-stream tasks: paper difficulty estimation, examination paper retrieval, and paper clustering. The experimental results demonstrate the effectiveness of our method.

References

[1]
Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).
[2]
Lidong Bing, Bai Sun, Shan Jiang, Yan Zhang, and Wai Lam. 2010. Learning ontology resolution for document representation and its applications in text mining. In Proceedings of the 19th ACM international conference on Information and knowledge management. 1713--1716.
[3]
Liqun Chen, Dong Wang, Zhe Gan, Jingjing Liu, Ricardo Henao, and Lawrence Carin. 2021. Wasserstein contrastive representation distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16296--16305.
[4]
Xiang Deng, Prashant Shiralkar, Colin Lockard, Binxuan Huang, and Huan Sun. 2022. DOM-LM: Learning Generalizable Representations for HTML Documents. arXiv preprint arXiv:2201.10608 (2022).
[5]
Ronald L Flaugher, Richard S Melton, and Charles T Myers. 1968. Item rearrangement under typical test conditions. Educational and Psychological Measurement, Vol. 28, 3 (1968), 813--824.
[6]
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 6894--6910.
[7]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, Vol. 33 (2020), 21271--21284.
[8]
Zongyan Han, Zhenyong Fu, Shuo Chen, and Jian Yang. 2021. Contrastive embedding for generalized zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2371--2381.
[9]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729--9738.
[10]
Ye Huang, Wei Huang, Shiwei Tong, Zhenya Huang, Qi Liu, Enhong Chen, Jianhui Ma, Liang Wan, and Shijin Wang. 2021a. STAN: Adversarial Network for Cross-domain Question Difficulty Prediction. In 2021 IEEE International Conference on Data Mining (ICDM). IEEE, 220--229.
[11]
Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia. 4083--4091.
[12]
Zhenya Huang, Xin Lin, Hao Wang, Qi Liu, Enhong Chen, Jianhui Ma, Yu Su, and Wei Tong. 2021b. Disenqnet: Disentangled representation learning for educational questions. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 696--704.
[13]
Zhenya Huang, Qi Liu, Enhong Chen, Hongke Zhao, Mingyong Gao, Si Wei, Yu Su, and Guoping Hu. 2017. Question Difficulty Prediction for READING Problems in Standard Tests. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
[14]
Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. 2019. Funsd: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 2. IEEE, 1--6.
[15]
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171--4186.
[16]
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. Advances in neural information processing systems, Vol. 33 (2020), 18661--18673.
[17]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning. PMLR, 1188--1196.
[18]
Junlong Li, Yiheng Xu, Lei Cui, and Furu Wei. 2022a. MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6078--6087.
[19]
Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, and Furu Wei. 2022b. Dit: Self-supervised pre-training for document image transformer. In Proceedings of the 30th ACM International Conference on Multimedia. 3530--3539.
[20]
Peizhao Li, Jiuxiang Gu, Jason Kuen, Vlad I Morariu, Handong Zhao, Rajiv Jain, Varun Manjunatha, and Hongfu Liu. 2021. Selfdoc: Self-supervised document representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5652--5660.
[21]
Qi Liu, Zai Huang, Zhenya Huang, Chuanren Liu, Enhong Chen, Yu Su, and Guoping Hu. 2018. Finding similar exercises in online education systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1821--1830.
[22]
Qi Liu, Zhenya Huang, Yu Yin, Enhong Chen, Hui Xiong, Yu Su, and Guoping Hu. 2021. EKT: Exercise-Aware Knowledge Tracing for Student Performance Prediction. IEEE Transactions on Knowledge and Data Engineering, Vol. 33, 1 (2021), 100--115.
[23]
Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019a. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In Proceedings of NAACL-HLT. 32--39.
[24]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[25]
Yixiao Ma, Shiwei Tong, Ye Liu, Likang Wu, Qi Liu, Enhong Chen, Wei Tong, and Zi Yan. 2021. Enhanced Representation Learning for Examination Papers with Hierarchical Document Structure. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2156--2160.
[26]
Katherine MacNicol. 1956. Effects of varying order of item difficulty in an unspeeded verbal test. Unpublished manuscript, Educational Testing Service, Princeton, NJ (1956).
[27]
J MacQueen. 1967. Classification and analysis of multivariate observations. In 5th Berkeley Symp. Math. Statist. Probability. University of California Los Angeles LA USA, 281--297.
[28]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
[29]
OpenAI. 2023. GPT-4 Technical Report. arxiv: 2303.08774 [cs.CL]
[30]
Seongsik Park, Dongkeun Yoon, and Harksoo Kim. 2022. Improving Graph-based Document-Level Relation Extraction Model with Novel Graph Structure. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4379--4383.
[31]
Shuai Peng, Ke Yuan, Liangcai Gao, and Zhi Tang. 2021. MathBERT: A Pre-Trained Model for Mathematical Formula Understanding. arXiv e-prints (2021), arXiv--2105.
[32]
Yuanyuan Qi, Jiayue Zhang, Yansong Liu, Weiran Xu, and Jun Guo. 2020. CGTR: Convolution Graph Topology Representation for Document Ranking. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2173--2176.
[33]
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).
[34]
Fu Sun, Feng-Lin Li, Ruize Wang, Qianglong Chen, Xingyi Cheng, and Ji Zhang. 2021. K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for Question Answering. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4125--4134.
[35]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).
[36]
Feng Wang and Huaping Liu. 2021. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2495--2504.
[37]
Suhang Wang, Charu Aggarwal, and Huan Liu. 2019. Beyond word2vec: Distance-graph tensor factorization for word and document embeddings. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1041--1050.
[38]
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3733--3742.
[39]
Chenhao Xie, Wenhao Huang, Jiaqing Liang, Chengsong Huang, and Yanghua Xiao. 2021. Webke: Knowledge extraction from semi-structured web with pre-trained markup language model. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2211--2220.
[40]
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020a. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1192--1200.
[41]
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, et al. 2020b. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740 (2020).
[42]
Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, and Weiran Xu. 2021. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 5065--5075.
[43]
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 1480--1489.
[44]
Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Graph convolutional networks for text classification. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 7370--7377.
[45]
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. Advances in neural information processing systems, Vol. 33 (2020), 17283--17297.

Index Terms

  1. PaperLM: A Pre-trained Model for Hierarchical Examination Paper Representation Learning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
      October 2023
      5508 pages
      ISBN:9798400701245
      DOI:10.1145/3583780
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 October 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. examination paper representation
      2. pre-trained language model
      3. structured document analysis

      Qualifiers

      • Research-article

      Funding Sources

      • the National Key Research and Development Program of China
      • the University Synergy Innovation Program of Anhui Province
      • the Laboratory of Cognitive Intelligence
      • the National Natural Science Foundation of China

      Conference

      CIKM '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 142
        Total Downloads
      • Downloads (Last 12 months)74
      • Downloads (Last 6 weeks)4
      Reflects downloads up to 17 Jan 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media