Skip to main content
Log in

Detecting automatically generated sentences with grammatical structure similarity

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Automatically generated papers have been used to manipulate bibliography indexes on numerous occasions. This paper is interested in different means to generate texts such as recurrent neural network, Markov model, or probabilistic context free grammar, and if it is possible to detect them using a current approach. Then, probabilistic context free grammar (PCFG) is focused on as the one most used. However, even though there have been multiple approaches to detect such types of paper, they are all working at the document level and are unable to detect a small amount of generated text inside a larger body of genuinely written text. Thus, we present the grammatical structure similarity measurement to detect sentences or short fragments of automatically generated text from known PCFG generators. The proposed approach is tested against a pattern checker and various common machine learning methods. Additionally, the ability to detect a modified PCFG generator is also tested.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://smritiweb.com/navin/education-2/how-i-published-a-fake-paper-and-why-it-is-the-fault-of-our-education-system.

  2. http://pdos.csail.mit.edu/scigen/.

  3. https://bitbucket.org/birkenfeld/scigen-physics.

  4. http://thatsmathematics.com/mathgen/.

  5. http://www.nadovich.com/chris/randprop/.

  6. http://karpathy.github.io/2015/05/21/rnn-effectiveness/.

  7. http://pan.webis.de/clef14/pan14-web/.

  8. Intel Core I5 2.4 GHz with 16Gb Ram.

  9. http://lexicometrie.imag.fr/scigendetection/dataset.zip.

References

  • Amancio, D. R. (2015). Authorship recognition via fluctuation analysis of network topology and word intermittency. Journal of Statistical Mechanics: Theory and Experiment, 2015(3), P03005.

    Article  MathSciNet  Google Scholar 

  • Amancio, D. R., Oliveira, O. N, Jr., & Costa, L. D. F. (2012). Structuresemantics interplay in complex networks and its effects on the predictability of similarity in texts. Physica A: Statistical Mechanics and its Applications, 391(18), 4406–4419.

    Article  Google Scholar 

  • Amancio, D. R. (2015). Comparing the topological properties of real and artificially generated scientific manuscripts. Scientometrics, 105(3), 1763–1779.

    Article  MathSciNet  Google Scholar 

  • Amancio, D. R. (2015). A complex network approach to stylometry. PLOS One, 10(8), e0136076.

    Article  Google Scholar 

  • Amancio, D. R., Comin, C. H., Casanova, D., Travieso, G., Bruno, O. M., Rodrigues, F. A., et al. (2014). A systematic comparison of supervised classifiers. PLOS One, 9(4), 1–14.

    Article  Google Scholar 

  • Bohannon, J. (2013). Who’s afraid of peer review? Science, 342(6154), 60–65.http://science.sciencemag.org/content/342/6154/60

  • Chomsky, N. (1956). Three models for the description of language. IEEE Transactions on Information Theory, 2(2), 113–124.

    Article  MATH  Google Scholar 

  • Collingwood, L., Jurka, T., Boydstun, A., Grossman, E., & van Atteveldt, W. (2013). Rtexttools: A supervised learning package for text classification. The R Journal, 5(1), 6–13.

    Google Scholar 

  • Culotta, A., & Sorensen, J. (2004). Dependency tree kernels for relation extraction. In Proceedings of the 42nd annual meeting on association for computational linguistics, ACL ’04, Association for Computational Linguistics, Stroudsburg, PA, USA.

  • Durán, K., Rodríguez, J., & Bravo, M. (2014). Similarity of sentences through comparison of syntactic trees with pairs of similar words. In 2014 11th international conference on electrical engineering, computing science and automatic control (CCE) (pp. 1–6).

  • Ginsparg, P. (2014). Automated screening: ArXiv screens spot fake papers. Nature, 508(7494), 44.

    Article  Google Scholar 

  • Graves, A. (2013). Generating sequences with recurrent neural networks. CoRR arXiv:abs/1308.0850

  • Kao, J. (2017). More than a million pro-repeal net neutrality comments were likely faked. https://hackernoon.com/more-than-a-million-pro-repeal-net-neutrality-comments-were-likely-faked-e9f0e3ed36a6. Accessed November 2017.

  • Klein, D., & Manning, C. D. (2003). Fast exact inference with a factored model for natural language parsing. In Advances in neural information processing systems 15 (NIPS) (pp. 3–10). MIT Press

  • Labbe, C. (2010). Ike Antkare one of the great stars in the scientific firmament. ISSI Newsletter, 6(2), 48–52.

    Google Scholar 

  • Labbé, C., & Labbé, D. (2013). Duplicate and fake publications in the scientific literature: How many scigen papers in computer science? Scientometrics, 94(1), 379–396.

    Article  MathSciNet  Google Scholar 

  • Labbé, C., Labbé, D., & Portet, F. (2016). Detection of computer-generated papers in scientific literature (pp. 123–141). Berlin: Springer.

    Google Scholar 

  • Lavoie, A., & Krishnamoorthy, M. (2010). Algorithmic detection of computer generated text. arXiv preprint arXiv:1008.0706.

  • López-Cózar, E. D., Robinson-Garcia, N., & Torres-Salinas, D. (2012). Manipulating google scholar citations and google scholar metrics: Simple, easy and tempting. CoRR arXiv:abs/1212.0638

  • Nguyen, M., & Labbé, C. (2016). Engineering a tool to detect automatically generated papers. In Proceedings of the third workshop on bibliometric-enhanced information retrieval co-located with the 38th European conference on information retrieval (ECIR 2016) (pp. 54–62).

  • Noorden, R. V. (2014). Publishers withdraw more than 120 gibberish papers. Nature News.

  • Ortuno, M., Carpena, P., Bernaola-Galván, P., Munoz, E., & Somoza, A. M. (2002). Keyword detection in natural languages and DNA. EPL (Europhysics Letters), 57(5), 759.

    Article  Google Scholar 

  • Sochenkov, I., Zubarev, D., Tikhomirov, I., Smirnov, I., Shelmanov, A., Suvorov, R., & Osipov, G. (2016). Exactus like: Plagiarism detection in scientific texts. In: European conference on information retrieval (pp. 837–840).

  • Sutskever, I., Martens, J., & Hinton, G.E. (2011). Generating text with recurrent neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11). pp. 1017–1024.

  • Wang, R., & Neumann, G. (2007). Recognizing textual entailment using sentence similarity based on dependency tree skeletons. In: Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, RTE ’07 (pp. 36–41). Association for Computational Linguistics, Stroudsburg, PA, USA.

  • Williams, K., & Giles, C. L. (2015). On the use of similarity search to detect fake scientific papers. In 8th international conference similarity search and applications, SISAP 2015 (pp. 332–338).

  • Xiong, J., & Huang, T. (2009). An effective method to identify machine automatically generated paper. In Knowledge engineering and software engineering (pp. 101–102).

  • Zubarev, D., & Sochenkov, I. (2014). Using sentence similarity measure for plagiarism source retrieval. In CLEF (Working Notes) (pp. 1027–1034).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nguyen Minh Tien.

Appendix

Appendix

See Figs. 7 and 8.

Fig. 7
figure 7

An excerpt from a partially generated paper by Navin Kabra where a genuinely written paragraph is marked in blue

Fig. 8
figure 8

An excerpt from a partially generated paper that was submitted to ICAART 2014 with an unknown number of generated sentences

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tien, N.M., Labbé, C. Detecting automatically generated sentences with grammatical structure similarity. Scientometrics 116, 1247–1271 (2018). https://doi.org/10.1007/s11192-018-2789-4

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-018-2789-4

Keywords

Navigation