Structural Features for Predicting the Linguistic Quality of Text

Nenkova, Ani; Chae, Jieun; Louis, Annie; Pitler, Emily

doi:10.1007/978-3-642-15573-4_12

Ani Nenkova²¹,
Jieun Chae²¹,
Annie Louis²¹ &
…
Emily Pitler²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5790))

Included in the following conference series:

1277 Accesses
4 Citations

Abstract

Sentence structure is considered to be an important component of the overall linguistic quality of text. Yet few empirical studies have sought to characterize how and to what extent structural features determine fluency and linguistic quality. We report the results of experiments on the predictive power of syntactic phrasing statistics and other structural features for these aspects of text. Manual assessments of sentence fluency for machine translation evaluation and text quality for summarization evaluation are used as gold-standard. We find that many structural features related to phrase length are weakly but significantly correlated with fluency and classifiers based on the entire suite of structural features can achieve high accuracy in pairwise comparison of sentence fluency and in distinguishing machine translations from human translations. We also test the hypothesis that the learned models capture general fluency properties applicable to human-authored text. The results from our experiments do not support the hypothesis. At the same time structural features and models based on them prove to be robust for automatic evaluation of the linguistic quality of multi-document summaries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bailin, A., Grafstein, A.: The linguistic assumptions underlying readability formulae: a critique. Language and Communication 21, 285–301 (2001)
Article Google Scholar
Bangalore, S., Rambow, O.: Exploiting a probabilistic hierarchical model for generation. In: Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000), pp. 42–48 (2000)
Google Scholar
Bangalore, S., Rambow, O., Whittaker, S.: Evaluation metrics for generation. In: Proceedings of the First International Conference on Natural Language Generation (INLG 2000), pp. 1–8 (2000)
Google Scholar
Banko, M., Mittal, V., Witbrock, M.: Headline generation based on statistical translation. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL 2000), pp. 318–325 (2000)
Google Scholar
Barzilay, R., Lapata, M.: Modeling local coherence: An entity-based approach. Computational Linguistics 34(1), 1–34 (2008)
Article Google Scholar
Barzilay, R., McKeown, K.R.: Sentence fusion for multidocument news summarization. Computational Linguistics 31(3), 297–328 (2005)
Article MATH Google Scholar
Cahill, A., Forst, M.: Human evaluation of a German surface realisation ranker. In: Krahmer, E., Theune, M. (eds.) Empirical Methods in NLG. LNCS (LNAI), vol. 5790, pp. 201–221. Springer, Heidelberg (2010)
Google Scholar
Cahill, A., Forst, M., Rohrer, C.: Stochastic realisation ranking for a free word order language. In: Proceedings of the Eleventh European Workshop on Natural Language Generation (ENLG 2007), pp. 17–24 (2007)
Google Scholar
Charniak, E., Johnson, M.: Coarse-to-fine n-best parsing and maxent discriminative reranking. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 173–180 (2005)
Google Scholar
Charniak, E.: A maximum-entropy-inspired parser. In: Proceedings of the 1st North American chapter of the Association for Computational Linguistics Conference (NAACL 2000), pp. 132–139 (2000)
Google Scholar
Clarke, J., Lapata, M.: Models for sentence compression: A comparison across domains, training requirements and evaluation measures. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), pp. 377–384 (2006)
Google Scholar
Collins, M., Koo, T.: Discriminative reranking for natural language parsing. Computational Linguistics 31(1), 25–70 (2005)
Article MathSciNet MATH Google Scholar
Collins-Thompson, K., Callan, J.P.: A language modeling approach to predicting reading difficulty. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pp. 193–200 (2004)
Google Scholar
Conroy, J., Dang, H.: Mind the gap: dangers of divorcing evaluations of summary content from linguistic quality. In: Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), pp. 145–152 (2008)
Google Scholar
Corston-Oliver, S., Gamon, M., Brockett, C.: A machine learning approach to the automatic evaluation of machine translation. In: Proceedings of 39th Annual Meeting of the Association for Computational Linguistics (ACL 2001), pp. 148–155 (2001)
Google Scholar
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 391–407 (1990)
Article Google Scholar
Elsner, M., Austerweil, J., Charniak, E.: A unified local and global model for discourse coherence. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pp. 436–443 (2007)
Google Scholar
Galley, M., McKeown, K.: Lexicalized Markov grammars for sentence compression. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pp. 180–187 (2007)
Google Scholar
Graesser, A., McNamara, D., Louwerse, M., Cai, Z.: Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods Instruments and Computers 36(2), 193–202 (2004)
Article Google Scholar
Grosz, B., Joshi, A., Weinstein, S.: Centering: a framework for modelling the local coherence of discourse. Computational Linguistics 21(2), 203–226 (1995)
Google Scholar
Haberlandt, K., Graesser, A.: Component processes in text comprehension and some of their interactions. Journal of Experimental Psychology: General 114(3), 357–374 (1985)
Article Google Scholar
Holmes, G., Donkin, A., Witten, I.: Weka: A machine learning workbench. In: Second Australian and New Zealand Conference on Intelligent Information Systems, pp. 357–361 (1994)
Google Scholar
Huang, L.: Forest reranking: Discriminative parsing with non-local features. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL 2008: HLT), pp. 586–594 (2008)
Google Scholar
Jing, H.: Sentence reduction for automatic text summarization. In: Proceedings of the Sixth Conference on Applied Natural Language Processing (ANLP 2000), pp. 310–315 (2000)
Google Scholar
Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142 (2002)
Google Scholar
Just, M., Carpenter, P.: The psychology of reading and language comprehension, Allyn, Bacon (1987)
Google Scholar
Karamanis, N., Mellish, C., Poesio, M., Oberlander, J.: Evaluating centering for information ordering using corpora. Computational Linguististics 35(1), 29–46 (2009)
Article Google Scholar
Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artificial Intelligence 139(1), 91–107 (2002)
Article MATH Google Scholar
Langkilde, I., Knight, K.: Generation that exploits corpus-based statistical knowledge. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL 1998), pp. 704–710 (1998)
Google Scholar
Lapata, M.: Probabilistic text structuring: Experiments with sentence ordering. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003), pp. 545–552 (2003)
Google Scholar
Lapata, M., Barzilay, R.: Automatic evaluation of text coherence: models and representations. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005), pp. 1085–1090 (2005)
Google Scholar
Lin, C.Y., Hovy, E.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL 2003), pp. 71–78 (2003)
Google Scholar
Lin, C.: Rouge: A package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), pp. 25–26 (2004)
Google Scholar
McDonald, R.: Discriminative sentence compression with soft syntactic evidence. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), pp. 297–304 (2006)
Google Scholar
Mutton, A., Dras, M., Wan, S., Dale, R.: GLEU: Automatic evaluation of sentence-level fluency. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007), pp. 344–351 (2007)
Google Scholar
Over, P., Dang, H., Harman, D.: DUC in context. Information Processing Management 43(6), 1506–1520 (2007)
Article Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pp. 311–318 (2002)
Google Scholar
Petersen, S.E., Ostendorf, M.: A machine learning approach to reading level assessment. Computer Speech and Language 23(1), 89–106 (2009)
Article Google Scholar
Pitler, E., Nenkova, A.: Revisiting readability: a unified framework for predicting text quality. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), pp. 186–195 (2008)
Google Scholar
Rieser, V., Lemon, O.: Natural language generation as planning under uncertainty for spoken dialogue systems. In: Krahmer, E., Theune, M. (eds.) Empirical Methods in NLG. LNCS (LNAI), vol. 5790, pp. 105–120. Springer, Heidelberg (2010)
Google Scholar
Schwarm, S., Ostendorf, M.: Reading level assessment using support vector machines and statistical language models. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 523–530 (2005)
Google Scholar
Siddharthan, A.: Syntactic simplification and Text Cohesion. Ph.D. thesis, University of Cambridge, UK (2003)
Google Scholar
Soricut, R., Marcu, D.: Abstractive headline generation using WIDL-expressions. Information Processing and Management 43(6), 1536–1548 (2007)
Article Google Scholar
Soricut, R., Marcu, D.: Discourse generation using utility-trained coherence models. In: Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pp. 803–810 (2006)
Google Scholar
Stolcke, A.: SRILM – an extensible language modeling toolkit. In: Seventh International Conference on Spoken Language Processing (ICSLP 2002), vol. 3 (2002)
Google Scholar
Turner, J., Charniak, E.: Supervised and unsupervised learning for sentence compression. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL 2005), pp. 290–297 (2005)
Google Scholar
Velldal, E., Oepen, S.: Maximum entropy models for realization ranking. In: Proceedings of the 10th Machine Translation Summit, pp. 109–116 (2005)
Google Scholar
Wan, S., Dale, R., Dras, M.: Searching for grammaticality: Propagating dependencies in the Viterbi algorithm. In: Proceedings of the Tenth European Workshop on Natural Language Generation (ENLG 2005), pp. 211–216 (2005)
Google Scholar
Zajic, D., Dorr, B., Lin, J., Schwartz, R.: Multi-candidate reduction: Sentence compression as a tool for document summarization tasks. Information Processing Management 43(6), 1549–1570 (2007)
Article Google Scholar
Zwarts, S., Dras, M.: Choosing the right translation: A syntactically informed classification approach. In: Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), pp. 1153–1160 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Pennsylvania, USA
Ani Nenkova, Jieun Chae, Annie Louis & Emily Pitler

Authors

Ani Nenkova
View author publications
You can also search for this author in PubMed Google Scholar
Jieun Chae
View author publications
You can also search for this author in PubMed Google Scholar
Annie Louis
View author publications
You can also search for this author in PubMed Google Scholar
Emily Pitler
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Humanities, Department of Communication and Information Sciences (DCI), Tilburg University, P.O.Box 90153, 5000 LE, Tilburg, The Netherlands
Emiel Krahmer
Human Media Interaction (HMI), Department of Electrical Engineering, Mathematics and Computer Science (EEMCS), University of Twente, P.O. Box 217, 7500 AE, Enschede, The Netherlands
Mariët Theune

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nenkova, A., Chae, J., Louis, A., Pitler, E. (2010). Structural Features for Predicting the Linguistic Quality of Text. In: Krahmer, E., Theune, M. (eds) Empirical Methods in Natural Language Generation. EACL ENLG 2009 2009. Lecture Notes in Computer Science(), vol 5790. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15573-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-15573-4_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15572-7
Online ISBN: 978-3-642-15573-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics