Automatic paper writing based on a RNN and the TextRank algorithm

https://doi.org/10.1016/j.asoc.2020.106767Get rights and content

Highlights

  • The proposed framework can generate a comprehensive introduction based on users’ brief inputs.

  • This paper demonstrates a three-testing-threshold approach to check the content quality.

  • The introduction generated through this method has less than 3% similarity to the original text by comparison.

Abstract

Academic research is crucial to the development of science and technology and is an important factor that affects national strength. When writing an academic research paper, a rhetorical structure is typically used to present the paper’s ideas, but this task is quite difficult for junior researchers. To solve this problem, some studies have adopted text mining to assist with the writing, but the existing methods still require human intervention to generate sentences. Recently, due to the increasing maturity of deep learning technology and the ability to address the problem of automatic text generation, progress has been made in this area. The highly complex deep learning operations can correctly generate sequences and find correlations between sequences. When a user provides a few keywords and key sentences, the proposed algorithm can generate an introduction section for the user. The results show that the generated introduction is more coherent, clearer, and more fluent than existing summarization methods. In addition, the method proposed in this study improves the accuracy compared with traditional text extraction methods. The manuscript produced by this study has been evaluated to show that the study can produce a comprehensive introduction compared with previous studies.

Introduction

Academic research in various fields not only helps to develop innovative technologies in various industries but also promotes industrial progress and national economic progress [1], [2], [3]. One fundamental research task is to develop high levels of academic writing [4]; however, academic writing can be a challenge for junior researchers [5], [6], [7], and it is especially problematic for those for whom English is a secondary language (ESL) or a foreign language (EFL) [8].

Compared to native speakers, the problems that junior researchers face in academic writing have been well discussed [9]. Some of the difficulties are sentence-level problems with grammar and vocabulary. However, a special feature of academic writing involves the importance of discourse organization, which junior researchers often encounter with regard to text and style. Based on the stylistic requirements of each section, these difficulties can be attributed to the fact that students are not native authors and are unfamiliar with academic norms. Writing an article from scratch is very difficult for these individuals.

It would help if a tool could be designed to assist these ESL scholars. Several grammatical error correction (GEC) technologies have been developed to help language learners [10]. However, these assisted systems only help to improve grammar by finding mistakes and typos; they barely support users in generating the article context. Several studies have proposed systems that can give users grammatical hints that assist in writing sentences [11], [12]. In [13], a writing assistant system was developed that provides dictionary content based on the user’s needs. This capability allows the user to focus on writing the article without having to consult external sources of information. However, it was found that those systems can only give short hints or phrase corrections and fail to achieve sentence prediction and deduction. In summary, the existing tools, which are still dominated by corrections and short prompts, are insufficient at helping ESL researchers to write academic paragraphs or articles, and junior researchers still find it difficult to write an article from scratch even with these tools. The author of [14] revealed that one of the main problems that junior researchers face is difficulty when starting to write a first draft; another problem is unfamiliarity with or inadequate use of the academic rhetorical genre. To help these researchers write well-organized paragraphs and articles that fit the typical academic structure, we must develop a sentence generation method that allows users to modify and learn the form of an academic article based on a generated article. Therefore, automatic research content generation is an appealing research topic.

The process of automatic text generation starts by analyzing large amounts of text data and then summarizing the content desired by the user to produce a readable article. Generally, text summarization can be divided into extractive and abstractive summarizations. Extractive summarization mainly selects the important words, phrases and sentences from the original text and recombines them into a new text according to a specific summary ratio [15]. The existing text generation research has been applied to the field of academic paper abstract generation [16]; however, the generated text typically has a narrative or concept that is quite different from that of the original text, and the statements are not fluent. Alternately, abstractive summarization rewrites the content to produce an abstract that represents the original file, where the vocabulary combinations do not stem entirely from the original file. This summary method produces results that are relatively close to the forms in which people write. However, the rewritten abstracts require techniques such as information extraction, discourse understanding, and natural language generation. Therefore, prior research has focused on extractive automatic summary. In recent years, abstractive generation methods have received much attention, and deep learning methods, especially Seq2Seq, are outstanding and are widely used in various fields [17], [18].

In the deep learning field, the traditional automatic summarization method has evolved from an extraction-based approach to an automatic generation-based approach. The recurrent neural network (RNN) [19] has become the most widely used autogenerated summary model, achieving significant results on both news headline generation and story generation tasks. The attention mechanism has been widely used as a key component in RNN-based sequence-to-sequence learning frameworks [20] because it solves the RNN bottleneck in that only fixed-size input sequences are supported [21], [22], [23].

However, when using text generation technology as a method for generating academic articles, several problems must still be resolved. The first problem is whether the generate sentences contain important information [24]. The second problem is whether the generated sentences are related to the user’s research field, and the third is whether the semantics of the sentences are closely related. According to previous studies, academic papers usually have one of two major types of rhetorical organization: Introduction, Methods, Results, and Discussion (IMRD) structure and Creating a Research Space (CARS) structure [25]. In contrast to IMRD, Martín and León Pérez [26] mentioned the implied CARS model, which summarizes the rhetorical structure of a paper’s introduction, and that structure has become the introductory rhetorical strategy adopted by many scholars. Furthermore, Anthony and Lashkia [27] found that it was not easy to fulfill the CARS model because CARS divides the introduction into too many parts. Overall, it is still difficult to generate articles in an academic format.

To solve the problems mentioned above, this study combines the advantages of both abstractive and extractive summarization methods by designing a text generation method based on TextRank, a deep learning method (Seq2Seq) and paper structure (‘moves’, defined below) to generate an introduction section for the users. The users input keywords and a few key sentences based on their needs to reduce the physical effort and writing time required to meet academic paper format specifications.

We propose a paper creation structure named Create A Research Space Generation (CARS-GEN) to improve this problem. CARS-GEN consists of three major moves and five core steps. Initially, articles are collected based on user-entered keywords. Then, the collected articles are divided into different moves and steps. Then, TextRank is used to extract the keyword sentences from the moves. Then, the keyword sentences are input into an RNN to generate the entire article. Finally, a three-testing-threshold approach is applied to determine whether the generated results meet the requirements. A three-testing-threshold approach (importance test, thematic test, and coherence test) is proposed to detect whether the generated sentences conform to the research theme desired by the user, whether the content is important, and whether the sentences have coherence. This approach improves the quality of the generated text from the introduction content when the algorithm learns the process of automatic text generation. The experimental results show that the generated articles are informative, fluent, and conform to academic article formats.

Section snippets

Academic article writing and automatic generation

According to Kwan’s [25], academic papers usually have a common structure called Introduction–Method–Result–Discussion (IMRD). The introduction is the first and most important part of the article, and it contains an overall representation of the full text and structure. A well-written introduction is important because it helps to capture attention and convince people to click through. The study also found the Create a Research Space (CARS) structure, which many people follow when writing the

Methodology

The framework of this study is shown in Fig. 1, including the data preprocessing module, the text summarization module, the sequence learning and sentence generation module, and the text filtering module.

Initially, to appropriately acquire the user’s intent for text generation, this system starts by collecting the user’s keywords of interest and a few (approximately 5) corresponding key sentences in the rhetorical structure. The keywords are primarily used to search for papers, and the key

Data source

This study collects papers and generates an introduction for specific academic fields. The data selected for this study were acquired from ScienceDirect, which contains papers published over the past six years (2012–2018). We invited three subjects, A, B and C, to perform the evaluations. To enable the participants to effectively evaluate the quality of the papers, we asked each of the three participants to provide an unpublished research paper and several keywords. A total of 11,404 articles

Discussion

The main objective of this study was to design an article generation method to assist users in writing academic articles. A combination of deep learning, TextRank and word2vec addresses the flaws in the abstracted method. This combination can select more representative sentences and generate a summary that meets the objectives of this study. This result means that the combination of abstractive and extractive methods does help when generating abstracts. The experimental results also show

Conclusions and future work

Scientific research is crucial for promoting technological development. Especially for junior researchers, it is not easy to write academic papers in a limited amount of time and fit them into the existing research structure.

In this paper, we have proposed a new method to quickly generate a new research paper based on the researcher’s needs and to help researchers write academic papers that are well organized, well-structured and complete in accordance with existing literary norms; in this way,

CRediT authorship contribution statement

Hei-Chia Wang: Project administration, Supervision, Methodology. Wei-Ching Hsiao: Data curation, Writing - review & editing, Methodology. Sheng-Han Chang: Methodology, Investigation, Writing - original draft, Software, Conceptualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The research is based on work supported by the Taiwan Ministry of Science and Technology under Grant No. MOST 107- 2410-H-006 040-MY3 and MOST 108-2511-H-0 06-0 09. We would like to thank the Center of Innovative Fintech Business Models, Taiwan for a research grant to support this research.

References (52)

  • KwanB.S.C.

    A cross-paradigm macro-structure analysis of research articles in information systems

    Engl. Specif. Purp.

    (2017)
  • MartínP. et al.

    Convincing peers of the value of one’s research: A genre analysis of rhetorical promotion in academic texts

    Engl. Specif. Purp.

    (2014)
  • QiangJ.-P. et al.

    Multi-document summarization using closed patterns

    Knowl.-Based Syst.

    (2016)
  • HuY.-H. et al.

    Opinion mining from online hotel reviews–a text summarization approach

    Inf. Process. Manage.

    (2017)
  • BatchaN.K. et al.

    CRF Based feature extraction applied for supervised automatic text summarization

    Proc. Technol.

    (2013)
  • ShouzhongT. et al.

    Mining microblog user interests based on textrank with TF-IDF factor

    J. China Univ. Posts Telecommun.

    (2016)
  • JaffeA.B.

    Real effects of academic research

    Amer. Econ. Rev.

    (1989)
  • AcsZ.J. et al.

    Real effects of academic research: comment

    Amer. Econ. Rev.

    (1992)
  • ElanderJ. et al.

    Complex skills and academic writing: a review of evidence about the types of learning required to meet core assessment criteria

    Assess Eval. High. Educ.

    (2006)
  • ShawP. et al.

    What develops in the development of second-language writing?

    Appl. Linguist.

    (1998)
  • LeacockC. et al.

    Automated grammatical error detection for language learners

    Synth. Lect. Hum. Lang. Technol.

    (2010)
  • YenT.-H. et al.

    WriteAhead: Mining grammar patterns in corpora for assisted writing

    (2015)
  • J. Chang, J. Chang, WriteAhead2: Mining lexical grammar patterns for assisted writing, in: Proceedings of the 2015...
  • TarpS. et al.

    L2 writing assistants and context-aware dictionaries: New challenges to lexicography

    Lexikos

    (2017)
  • ReithR.L. et al.

    Support tools to assist scientific writing: assessment of key features to construct a system for production engineering

    Int. J. Bus. Innov. Res.

    (2017)
  • DušekO. et al.

    A context-aware natural language generator for dialogue systems

    (2016)
  • Cited by (0)

    View full text