Skip to main content

Crawling by Readability Level

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9727))

Abstract

The availability of annotated corpora for research in the area of Readability Assessment is still very limited. On the other hand, the Web is increasingly being used by researchers as a source of written content to build very large and rich corpora, in the Web as Corpus (WaC) initiative. This paper proposes a framework for automatic generation of large corpora classified by readability. It adopts a supervised learning method to incorporate a readability filter based in features with low computational cost to a crawler, to collect texts targeted at a specific reading level. We evaluate this framework by comparing a readability-assessed web crawled corpus to a reference corpus (Both corpora are available in http://www.inf.ufrgs.br/pln/resource/CrawlingByReadabilityLevel.zip.). The results obtained indicate that these features are good at separating texts from level 1 (initial grades) from other levels. As a result of this work two Portuguese corpora were constructed: the Wikilivros Readability Corpus, classified by grade level, and a crawled WaC classified by readability level.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www.linguateca.pt/ACDC/.

  2. 2.

    http://dinis2.linguateca.pt/acesso/tokens/formas.totalbr.txt.

  3. 3.

    http://www.bing.com/toolbox/bingsearchapi.

  4. 4.

    The toolkit is divided in a web crawling module, several combinable filter modules, a deduplication module and a post-processing module responsible for the annotation and compilation of the corpus.

  5. 5.

    https://pt.wikibooks.org/.

  6. 6.

    All correlations presented a significance level higher than 99 %.

References

  1. Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Lang. Resour. Eval. 43(3), 209–226 (2009)

    Article  Google Scholar 

  2. Boos, R., Prestes, K., Villavicencio, A., Padró, M.: brWaC: a wacky corpus for Brazilian Portuguese. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.G. (eds.) PROPOR 2014. LNCS, vol. 8775, pp. 201–206. Springer, Heidelberg (2014)

    Google Scholar 

  3. Callan, J., Eskenazi, M.: Combining lexical and grammatical features to improve readability measures for first and second language texts. In: Proceedings of NAACL HLT, pp. 460–467 (2007)

    Google Scholar 

  4. Chall, J.S., Dale, E.: Readability Revisited: The new Dale-Chall readability formula. Brookline Books, Cambridge (1995)

    Google Scholar 

  5. Coleman, M., Liau, T.L.: A computer readability formula designed for machine scoring. J. Appl. Psychol. 60(2), 283 (1975)

    Article  Google Scholar 

  6. DuBay, W.H.: The principles of readability. Online Submission (2004)

    Google Scholar 

  7. Feng, L., Elhadad, N., Huenerfauth, M.: Cognitively motivated features for readability assessment. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 229–237. Association for Computational Linguistics (2009)

    Google Scholar 

  8. Feng, L., Jansche, M., Huenerfauth, M., Elhadad, N.: A comparison of features for automatic readability assessment. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING 2010, pp. 276–284. Association for Computational Linguistics, Stroudsburg (2010). http://dl.acm.org/citation.cfm?id=1944566.1944598

  9. Ferraresi, A., Bernardini, S.: The academic web-as-corpus. In: Proceedings of the 8th Web as Corpus Workshop, pp. 53–62 (2013)

    Google Scholar 

  10. Flesch, R.F., et al.: Art of Plain Talk. Harper, New York (1946)

    Google Scholar 

  11. François, T., Miltsakaki, E.: Do nlp and machine learning improve traditional readability formulas? In: Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations, pp. 49–57. Association for Computational Linguistics (2012)

    Google Scholar 

  12. Gasperin, C., Specia, L., Pereira, T., Aluísio, S.: Learning when to simplify sentences for natural text simplification. In: Proceedings of ENIA - Brazilian Meeting on Artificial Intelligence, pp. 809–818 (2009)

    Google Scholar 

  13. Graesser, A.C., McNamara, D.S., Louwerse, M.M., Cai, Z.: Coh-metrix: analysis of text on cohesion and language. Behav. Res. methods Instrum. comput. 36(2), 193–202 (2004)

    Article  Google Scholar 

  14. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

  15. Landwehr, N., Hall, M., Frank, E.: Logistic model trees. Mach. Learn. 59(1–2), 161–205 (2005)

    Article  MATH  Google Scholar 

  16. Martins, T.B., Ghiraldelo, C.M., Nunes, M.d.G.V., de Oliveira Junior, O.N.: Readability formulas applied to textbooks in brazilian portuguese. Icmsc-Usp (1996)

    Google Scholar 

  17. McNamara, D.S., Louwerse, M.M., McCarthy, P.M., Graesser, A.C.: Coh-metrix: capturing linguistic features of cohesion. Discourse Processes 47(4), 292–330 (2010)

    Article  Google Scholar 

  18. McNamara, D., Louwerse, M., Cai, Z., Graesser, A.: Coh-metrix version 3.0 (2013). http://cohmetrix.com. Accessed 1 Apr 2015

  19. Navigli, R., Ponzetto, S.P.: Babelnet: building a very large multilingual semantic network. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 216–225. Association for Computational Linguistics (2010)

    Google Scholar 

  20. Neto, N., Rocha, W., Sousa, G.: An open-source rule-based syllabification tool for Brazilian Portuguese. J. Braz. Comput. Soc. 21(1), 1–10 (2015)

    Article  Google Scholar 

  21. Petersen, S.E., Ostendorf, M.: A machine learning approach to reading level assessment. Comput. Speech Lang. 23(1), 89–106 (2009)

    Article  Google Scholar 

  22. Pomikálek, J.: Removing boilerplate and duplicate content from web corpora. Ph.D. en informatique, Masarykova univerzita, Fakulta informatiky (2011)

    Google Scholar 

  23. Scarton, C., Aluısio, S.M.: Coh-metrix-port: a readability assessment tool for texts in Brazilian Portuguese. In: Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language, Extended Activities Proceedings, PROPOR, vol. 10 (2010)

    Google Scholar 

  24. Scarton, C., Gasperin, C., Aluisio, S.: Revisiting the readability assessment of texts in Portuguese. In: Kuri-Morales, A., Simari, G.R. (eds.) IBERAMIA 2010. LNCS, vol. 6433, pp. 306–315. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  25. Schwarm, S.E., Ostendorf, M.: Reading level assessment using support vector machines and statistical language models. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 523–530. Association for Computational Linguistics (2005)

    Google Scholar 

  26. Stenner, A.J.: Measuring Reading Comprehension with the Lexile Framework. ERIC, Washington (1996)

    Google Scholar 

  27. Vajjala, S., Meurers, D.: On the applicability of readability models to web texts. In: Proceedings of the 2nd Workshop on Predicting and Improving Text Readability for Target Reader Populations, p. 59 (2013)

    Google Scholar 

  28. Vajjala, S., Meurers, D.: Exploring measures of readability for spoken language: analyzing linguistic features of subtitles to identify age-specific tv programs. In: Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)@ EACL, pp. 21–29 (2014)

    Google Scholar 

  29. Ziai, R., Ott, N.: Web as Corpus Toolkit: Users and Hackers Manual. Lexical Computing Ltd., Brighton (2005)

    Google Scholar 

Download references

Acknowledgments

This research was partially developed in the context of the project Text Simplification of Complex Expressions, sponsored by Samsung Eletrônica da Amazônia Ltda., in the terms of the Brazilian law n. 8.248/91. This work was also partly supported by CNPq (482520/2012- 4, 312114/2015-0) and FAPERGS AiMWEst.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rodrigo Wilkens .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Filho, J.A.W., Wilkens, R., Zilio, L., Idiart, M., Villavicencio, A. (2016). Crawling by Readability Level. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham. https://doi.org/10.1007/978-3-319-41552-9_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41552-9_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41551-2

  • Online ISBN: 978-3-319-41552-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics