research-article

PaperLM: A Pre-trained Model for Hierarchical Examination Paper Representation Learning

Authors:

Shijin WangAuthors Info & Claims

CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Pages 2178 - 2187

https://doi.org/10.1145/3583780.3615003

Published: 21 October 2023 Publication History

Abstract

Representation learning of examination papers is significantly crucial for online education systems, as it benefits various applications such as estimating paper difficulty and examination paper retrieval. Previous works mainly explore the representation learning of individual questions in an examination paper, with limited attention given to the examination paper as a whole. In fact, the structure of examination papers is strongly correlated with paper properties such as paper difficulty, which existing paper representation methods fail to capture adequately. To this end, we propose a pre-trained model namely PaperLM to learn the representation of examination papers. Our model integrates both the text content and hierarchical structure of examination papers within a single framework by converting the path of the Examination Organization Tree (EOT) into embedding. Furthermore, we specially design three pre-training objectives for PaperLM, namely EOT Node Relationship Prediction (ENRP), Question Type Prediction (QTP) and Paper Contrastive Learning (PCL), aiming to capture features from text and structure effectively. We pre-train our model on a real-world examination paper dataset, and then evaluate the model with three down-stream tasks: paper difficulty estimation, examination paper retrieval, and paper clustering. The experimental results demonstrate the effectiveness of our method.

References

[1]

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).

[2]

Lidong Bing, Bai Sun, Shan Jiang, Yan Zhang, and Wai Lam. 2010. Learning ontology resolution for document representation and its applications in text mining. In Proceedings of the 19th ACM international conference on Information and knowledge management. 1713--1716.

Digital Library

[3]

Liqun Chen, Dong Wang, Zhe Gan, Jingjing Liu, Ricardo Henao, and Lawrence Carin. 2021. Wasserstein contrastive representation distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16296--16305.

[4]

Xiang Deng, Prashant Shiralkar, Colin Lockard, Binxuan Huang, and Huan Sun. 2022. DOM-LM: Learning Generalizable Representations for HTML Documents. arXiv preprint arXiv:2201.10608 (2022).

[5]

Ronald L Flaugher, Richard S Melton, and Charles T Myers. 1968. Item rearrangement under typical test conditions. Educational and Psychological Measurement, Vol. 28, 3 (1968), 813--824.

[6]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 6894--6910.

[7]

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, Vol. 33 (2020), 21271--21284.

[8]

Zongyan Han, Zhenyong Fu, Shuo Chen, and Jian Yang. 2021. Contrastive embedding for generalized zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2371--2381.

[9]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729--9738.

[10]

Ye Huang, Wei Huang, Shiwei Tong, Zhenya Huang, Qi Liu, Enhong Chen, Jianhui Ma, Liang Wan, and Shijin Wang. 2021a. STAN: Adversarial Network for Cross-domain Question Difficulty Prediction. In 2021 IEEE International Conference on Data Mining (ICDM). IEEE, 220--229.

[11]

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia. 4083--4091.

Digital Library

[12]

Zhenya Huang, Xin Lin, Hao Wang, Qi Liu, Enhong Chen, Jianhui Ma, Yu Su, and Wei Tong. 2021b. Disenqnet: Disentangled representation learning for educational questions. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 696--704.

Digital Library

[13]

Zhenya Huang, Qi Liu, Enhong Chen, Hongke Zhao, Mingyong Gao, Si Wei, Yu Su, and Guoping Hu. 2017. Question Difficulty Prediction for READING Problems in Standard Tests. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.

[14]

Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. 2019. Funsd: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 2. IEEE, 1--6.

[15]

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171--4186.

[16]

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. Advances in neural information processing systems, Vol. 33 (2020), 18661--18673.

[17]

Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning. PMLR, 1188--1196.

[18]

Junlong Li, Yiheng Xu, Lei Cui, and Furu Wei. 2022a. MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6078--6087.

[19]

Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, and Furu Wei. 2022b. Dit: Self-supervised pre-training for document image transformer. In Proceedings of the 30th ACM International Conference on Multimedia. 3530--3539.

Digital Library

[20]

Peizhao Li, Jiuxiang Gu, Jason Kuen, Vlad I Morariu, Handong Zhao, Rajiv Jain, Varun Manjunatha, and Hongfu Liu. 2021. Selfdoc: Self-supervised document representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5652--5660.

[21]

Qi Liu, Zai Huang, Zhenya Huang, Chuanren Liu, Enhong Chen, Yu Su, and Guoping Hu. 2018. Finding similar exercises in online education systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1821--1830.

Digital Library

[22]

Qi Liu, Zhenya Huang, Yu Yin, Enhong Chen, Hui Xiong, Yu Su, and Guoping Hu. 2021. EKT: Exercise-Aware Knowledge Tracing for Student Performance Prediction. IEEE Transactions on Knowledge and Data Engineering, Vol. 33, 1 (2021), 100--115.

Digital Library

[23]

Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019a. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In Proceedings of NAACL-HLT. 32--39.

[24]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[25]

Yixiao Ma, Shiwei Tong, Ye Liu, Likang Wu, Qi Liu, Enhong Chen, Wei Tong, and Zi Yan. 2021. Enhanced Representation Learning for Examination Papers with Hierarchical Document Structure. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2156--2160.

Digital Library

[26]

Katherine MacNicol. 1956. Effects of varying order of item difficulty in an unspeeded verbal test. Unpublished manuscript, Educational Testing Service, Princeton, NJ (1956).

[27]

J MacQueen. 1967. Classification and analysis of multivariate observations. In 5th Berkeley Symp. Math. Statist. Probability. University of California Los Angeles LA USA, 281--297.

[28]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).

[29]

OpenAI. 2023. GPT-4 Technical Report. arxiv: 2303.08774 [cs.CL]

[30]

Seongsik Park, Dongkeun Yoon, and Harksoo Kim. 2022. Improving Graph-based Document-Level Relation Extraction Model with Novel Graph Structure. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4379--4383.

Digital Library

[31]

Shuai Peng, Ke Yuan, Liangcai Gao, and Zhi Tang. 2021. MathBERT: A Pre-Trained Model for Mathematical Formula Understanding. arXiv e-prints (2021), arXiv--2105.

[32]

Yuanyuan Qi, Jiayue Zhang, Yansong Liu, Weiran Xu, and Jun Guo. 2020. CGTR: Convolution Graph Topology Representation for Document Ranking. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2173--2176.

Digital Library

[33]

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).

[34]

Fu Sun, Feng-Lin Li, Ruize Wang, Qianglong Chen, Xingyi Cheng, and Ji Zhang. 2021. K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for Question Answering. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4125--4134.

Digital Library

[35]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).

[36]

Feng Wang and Huaping Liu. 2021. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2495--2504.

[37]

Suhang Wang, Charu Aggarwal, and Huan Liu. 2019. Beyond word2vec: Distance-graph tensor factorization for word and document embeddings. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1041--1050.

Digital Library

[38]

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3733--3742.

[39]

Chenhao Xie, Wenhao Huang, Jiaqing Liang, Chengsong Huang, and Yanghua Xiao. 2021. Webke: Knowledge extraction from semi-structured web with pre-trained markup language model. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2211--2220.

Digital Library

[40]

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020a. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1192--1200.

Digital Library

[41]

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, et al. 2020b. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740 (2020).

[42]

Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, and Weiran Xu. 2021. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 5065--5075.

[43]

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 1480--1489.

[44]

Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Graph convolutional networks for text classification. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 7370--7377.

Digital Library

[45]

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. Advances in neural information processing systems, Vol. 33 (2020), 17283--17297.

Index Terms

PaperLM: A Pre-trained Model for Hierarchical Examination Paper Representation Learning
1. Applied computing
  1. Education
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Language models

Recommendations

Enhanced Representation Learning for Examination Papers with Hierarchical Document Structure
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Representation learning of examination papers is the cornerstone of the Examination Paper Analysis (EPA) in education area including Paper Difficulty Prediction (PDR) and Finding Similar Papers (FSP). Previous works mainly focus on the representation ...
Data representation learning via dictionary learning and self-representation
Abstract
Dictionary learning is an effective feature learning method, leading to many remarkable results in data representation and classification tasks. However, dictionary learning is performed on the original data representation. In some cases, the ...
Using Pre-trained Language Model to Enhance Active Learning for Sentence Matching
Active learning is an effective method to substantially alleviate the problem of expensive annotation cost for data-driven models. Recently, pre-trained language models have been demonstrated to be powerful for learning language representations. In this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

October 2023

5508 pages

ISBN:9798400701245

DOI:10.1145/3583780

General Chairs:
Ingo Frommholz
University of Wolverhampton, UK
,
Frank Hopfgartner
University of Koblenz, Germany
,
Mark Lee
University of Birmingham, UK
,
Michael Oakes
University of Birmingham, UK
,
Program Chairs:
Mounia Lalmas
Spotify, UK
,
Min Zhang
Tsinghua University, China
,
Rodrygo Santos
Federal University of Minas Gerais, Brazil

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the National Key Research and Development Program of China
the University Synergy Innovation Program of Anhui Province
the Laboratory of Cognitive Intelligence
the National Natural Science Foundation of China

Conference

CIKM '23

Sponsor:

CIKM '23: The 32nd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2023

Birmingham, United Kingdom

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
142
Total Downloads

Downloads (Last 12 months)74
Downloads (Last 6 weeks)4

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents