Skip to main content

Task-Related Pretraining with Whole Word Masking for Chinese Coherence Evaluation

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14304))

  • 498 Accesses

Abstract

This paper presents an approach for evaluating coherence in Chinese middle school student essays, addressing the challenges of time-consuming and inconsistent essay assessment. Previous approaches focused on linguistic features, but coherence, crucial for essay organization, has received less attention. Recent works utilized neural networks, such as CNN, LSTM, and transformers, achieving good performance with labeled data. However, labeling coherence manually is costly and time-consuming. To address this, we propose a method that pretrains RoBERTa with whole word masking (WWM) on a low-resource dataset of middle school essays, followed by finetuning for coherence evaluation. The WWM pretraining is unsupervised and captures general characteristics of the essays, adding little cost to the low-resource setting. Experimental results on Chinese essays demonstrate that this strategy improves coherence evaluation compared to naive finetuning on limited data. We also explore variants of their method, including pseudo labeling and additional neural networks, providing insights into potential performance trade-offs. The contributions of this work include the collection and curation of a substantial dataset, the proposal of a cost-effective pretraining method, and the exploration of alternative approaches for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/cubenlp/NLPCC-2023-Shared-Task7.

  2. 2.

    http://www.leleketang.com/zuowen/.

References

  • Asher, N., Lascarides, A.: Logics of Conversation. Cambridge University Press, Cambridge (2003)

    Google Scholar 

  • Barzilay, R., Lapata, M.: Modeling local coherence: an entity-based approach. Comput. Linguist. 34(1), 1–34 (2008)

    Article  Google Scholar 

  • Cui, Y., Che, W., Liu, T., Qin, B., Yang, Z.: Pre-training with whole word masking for Chinese BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3504–3514 (2021)

    Article  Google Scholar 

  • Farag, Y., Yannakoudakis, H.: Multi-task learning for coherence modeling. arXiv preprint arXiv:1907.02427 (2019)

  • Grosz, B.J., Joshi, A.K., Weinstein, S.: Centering: a framework for modelling the local coherence of discourse. Comput. Linguist. 21(2), 203–225 (1995)

    Google Scholar 

  • Guinaudeau, C., Strube, M.: Graph-based local coherence modeling. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 93–103 (2013)

    Google Scholar 

  • He, Y., Jiang, F., Chu, X., Li, P.: Automated Chinese essay scoring from multiple traits. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 3007–3016 (2022)

    Google Scholar 

  • Jeon, S., Strube, M.: Entity-based neural local coherence modeling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7787–7805 (2022)

    Google Scholar 

  • Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text organization. Text-interdisciplinary J. Study Discourse 8(3), 243–281 (1988)

    Article  Google Scholar 

  • Mesgar, M., Strube, M.: A neural local coherence model for text quality assessment. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4328–4339 (2018)

    Google Scholar 

  • Moon, H.C., Mohiuddin, T., Joty, S., Chi, X.: A unified neural coherence model. arXiv preprint arXiv:1909.00349 (2019)

  • Nguyen, D.T., Joty, S.: A neural local coherence model. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1320–1330 (2017)

    Google Scholar 

  • Song, W., Song, Z., Liu, L., Fu, R.: Hierarchical multi-task learning for organization evaluation of argumentative student essays. In: IJCAI, pp. 3875–3881 (2020)

    Google Scholar 

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (62076008) and the Key Project of Natural Science Foundation of China (61936012).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunfang Wu .

Editor information

Editors and Affiliations

Appendix

Appendix

During the process of improving overall accuracy, we have also experimented with some new model architectures, including the Cross Task Grader model mentioned below.

1.1 PFT+HAN

We proposed a multi-layer coherence evaluation model, depicted in Fig. 1, which firstly utilized pre-trained RoBERTa to extract features from the articles, followed by an attention pooling layer. Then, we concatenated punctuation-level embeddings and passed them through another attention pooling layer. Finally, we obtained the ultimate coherence score by using a classifier.

Fig. 1.
figure 1

PFT+HAN

Fig. 2.
figure 2

Cross Task Grader

Pre-trained Encoder. A sequence of words \(s_i=\{w_1,w_2,\ldots ,w_m\}\)is encoded with the pre-trained RoBERTa.

Paragraph Representation Layer. An attention pooling layer applied to the output of the pre-trained encoder layer is designed to capture the paragraph representations and is defined as follows:

$$\begin{aligned} m_{i}=tanh({W_m}\cdot {x_i}+{b_m}) \end{aligned}$$
(2)
$$\begin{aligned} u_i=\frac{e^{{w_u}\cdot {m_i}}}{\sum \limits _{j=1}^{m} e^{{w_u}\cdot {m_j}}} \end{aligned}$$
(3)
$$\begin{aligned} p=\sum \limits _{i=1}^{m} {u_i}\cdot {x_i} \end{aligned}$$
(4)

where \(W_m\) is a weights matrix, \(w_u\) is a weights vector, \(m_i\) is the attention vector for the i-th word, \(u_i\) is the attention weight for the i-th word, and p is the paragraph representation.

Essay Representation Layer. We incorporated punctuation representations to enhance the model’s performance. We encoded the punctuation information for each paragraph, obtaining the punctuation representation \(pu_i\) for each paragraph. Then, we concatenated this representation \(pu_i\) with the content representation \(p_i\) of each paragraph:

$$\begin{aligned} c_i=concatenate(p_i,pu_i) \end{aligned}$$
(5)

where \(c_i\) represents the representation of the concatenated i-th paragraph. Next, we use another layer of attention pooling to obtain the representation of the entire essay and is defined as follows:

$$\begin{aligned} a_{i}=tanh({W_a}\cdot {c_i}+{b_a}) \end{aligned}$$
(6)
$$\begin{aligned} v_i=\frac{e^{{w_v}\cdot {a_i}}}{\sum \limits _{j=1}^{a} e^{{w_v}\cdot {a_j}}} \end{aligned}$$
(7)
$$\begin{aligned} E=\sum \limits _{i=1}^{a} {v_i}\cdot {c_i} \end{aligned}$$
(8)

where \(W_a\) is a weights matrix, \(w_v\) is a weights vector, \(a_i\) is the attention vector for the i-th paragraph, \(v_i\) is the attention weight for the i-th paragraph, and E is the essay representation.

1.2 Cross Task Grader

We also used Multi-task Learning(MTL) in our experiment, which is depicted in Fig. 2.

We used both target data and some pseudo-labeled essays from various grade and created a separate PFT+HAN model for each. To facilitate multi-task learning, we adopted the Hard Parameter Sharing approach, sharing the pre-trained encoder layer and the first layer of attention pooling among all the models.Additionally, we added a cross attention layer before the classifier.

Cross Attention Layer. After obtaining the essay representation, we added a cross attention layer to learn the connections between different essays, defined as follows:

$$\begin{aligned} A=[E_1,E_2,\ldots ,E_N] \end{aligned}$$
(9)
$$\begin{aligned} \alpha ^{i}_{j}=\frac{e^{score(E_i,A_{i,j})}}{\sum \limits _i^{l} e^{score(E_i,A_{i,l})}} \end{aligned}$$
(10)
$$\begin{aligned} P_i=\sum \limits {\alpha ^{i}_{j}}\cdot {A_{i,j}} \end{aligned}$$
(11)
$$\begin{aligned} y_i=concatenate(E_i,P_i) \end{aligned}$$
(12)

where A is a concatenation of the representations for each task \([E_1,E_2,\ldots ,E_N]\), and \(\alpha ^{i}_{j}\), is the attention weight. We then calculate attention vector \(P_i\) through a summation of the product of each weight \(\alpha ^{i}_{j}\) and \(A_{i,j}\). The final representation \(y_i\) is a concatenation of \(E_i\) and \(P_i\).

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Z., Lee, S., Cai, Y., Wu, Y. (2023). Task-Related Pretraining with Whole Word Masking for Chinese Coherence Evaluation. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14304. Springer, Cham. https://doi.org/10.1007/978-3-031-44699-3_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44699-3_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44698-6

  • Online ISBN: 978-3-031-44699-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics