Skip to main content
Log in

Accurate and efficient general-purpose boilerplate detection for crawled web corpora

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Removal of boilerplate is one of the essential tasks in web corpus construction and web indexing. Boilerplate (redundant and automatically inserted material like menus, copyright notices, navigational elements, etc.) is usually considered to be linguistically unattractive for inclusion in a web corpus. Also, search engines should not index such material because it can lead to spurious results for search terms if these terms appear in boilerplate regions of the web page. In this paper, I present and evaluate a supervised machine-learning approach to general-purpose boilerplate detection for languages based on Latin alphabets using Multi-Layer Perceptrons (MLPs). It is both very efficient and very accurate (between 95 % and \(99\,\%\) correct classifications, depending on the input language). I show that language-specific classifiers greatly improve the accuracy of boilerplate detectors. The single features used for the classification are evaluated with regard to the merit they contribute to the classification. Furthermore, I show that the accuracy of the MLP is on a par with that of a wide range of other classifiers. My approach has been implemented in the open-source texrex web page cleaning software, and large corpora constructed using it are available from the COW initiative, including the CommonCOW corpora created from CommonCrawl datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Notes

  1. I use the term general-purpose boilerplate detection in opposition to template-based boilerplate detection, where the markup structure of the web page is known beforehand because, for example, only pages generated by a limited number of content management systems is processed.

  2. https://github.com/rsling/texrex.

  3. http://rolandschaefer.net/?p=88.

  4. http://corporafromtheweb.org and https://webcorpora.org.

  5. I admit that it is generally unfair to cite single scores, since all authors perform quite differentiated evaluations with fine-grained results. The values cited here are intended for a rough orientation only, and readers are advised to refer to the papers for full evaluations.

  6. Due to the general nature and the space constraints of Schäfer and Bildhauer (2012), not all of these figures are reported in the original paper. They are, however, reported in our book on web corpus construction (Schäfer and Bildhauer 2013, p. 56).

  7. In a two-dimensional space, two sets of points are linearly separable if they can be separated by a straight line. In higher-dimensional spaces, the straight line is generalized to a hyperplane. The n input features used to train the classifier define an n-dimensional space. A classifier that requires linear separability would impose the requirement that the boilerplate blocks and the non-boilerplate blocks be separable by a hyperplane in this space.

  8. http://cleaneval.sigwac.org.uk/.

  9. The full list is: <\(\texttt{article}\)>, <\(\texttt {blockquote}\)>, <\(\texttt {div}\)>, <\(\texttt {h1}\)>–<\(\texttt {h6}\)>, <\(\texttt {l}\)>, <\(\texttt {p}\)>, <\(\texttt {section}\)>, <\(\texttt {td}\)>, and their closing counterparts.

  10. http://leenissen.dk/fann/wp/.

  11. http://leenissen.dk/fann/html/files/fann-h.html.

  12. Notice that considering the fact that input decisions were all 0s and 1s, an optimal classifier is expected to perform very well at a threshold of 0.5.

  13. As a reviewer pointed out, it appears that the compatibility between the three Germanic languages is much higher than the compatibility between French and any of the Germanic languages or the pooled Germanic dataset. Therefore, classifiers for language families also seem like a viable option to be explored in the future.

  14. As the evaluation criterion for calculating the merit, the default was chosen [accuracy for discrete classes and root mean square error (RMSE) for numeric classes]. The leaveOneAttributeOut option was not used, i.e., each feature was evaluated in isolation. Using leaveOneAttributeOut was infeasible because even a single run without cross validation running 16 parallel threads took 15 h on a powerful machine, and the full evaluation would have taken approximately 25 days.

  15. Because training is a one-time process, training times are irrelevant in applications.

  16. See also the evaluation of document fragmentation in the dissertation where the jusText algorithm was introduced (Pomikálek 2011, 52–54).

References

  • Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3), 209–226.

    Article  Google Scholar 

  • Baroni, M., Chantree, F., Kilgarriff, A., & Sharoff, S. (2008). CleanEval: A competition for cleaning webpages. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan., B. Maegaard, J. Mariani, et al. (Eds.), Proceedings of the eighth international conference on language resources and evaluation (LREC ’12) (pp . 638–643). Istanbul: European Language Resources Association (ELRA).

  • Bauer, D., Degen, J., Deng, X., Herger, P., Gasthaus, J., Giesbrecht, E., et al. (2007). Filtering the internet by automatic subtree classification. In C. Fairon, H. Naets, A. Kilgarriff, & G. M. de Schryver (Eds.), Building and exploring web corpora: Proceedings of the 3rd web as corpus workshop (incorporating CLEANEVAL) (pp. 111–122). Louvain: Presses Universitaires de Louvain.

  • Biemann, C., Heyer, G., Quasthoff, U., & Richter, M. (2007). The Leipzig Corpora Collection—Monolingual corpora of standard size. In Proceedings of corpus linguistic 2007. Birmingham: University of Birmingham.

  • Broder, A. Z. (2000). Identifying and filtering near-duplicate documents. In D. Sanko & R. Giancarlo (Eds.), Proceedings of combinatorial pattern matching (pp. 1–10), Berlin.

  • Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 1–27.

    Article  Google Scholar 

  • Cortez, P. (2011). Data mining with multilayer perceptrons and support vector machines. In D. E. Holmes & L. C. Jain (Eds.), Data mining: Foundations and intelligent paradigms. Volume 2: Statistical, Bayesian, time series and other theoretical aspects (Vol. 2, pp. 9–23). Berlin: Springer.

    Google Scholar 

  • Evert, S., & Hardie, A. (2011). Twenty-first century corpus workbench: Updating a query architecture for the new millennium. In Proceedings corpus linguistics 2011. Birmingham: University of Birmingham.

  • Finn, A., Kushmerick, N., & Smyth, B. (2001). Fact or fiction: Content classification for digital libraries. In DELOS workshop: Personalisation and recommender systems in digital libraries.

  • Gallé, M., & Renders, J. M. (2014). Boilerplate detection and recoding. In M. de Rijke, T. Kenter, A. de Vries, C. X. Zhai, F. de Jong, K. Radinsky, et al. (Eds.), Advances in information retrieval—36th European conference on IR research, ECIR (pp. 462–467). Berlin: Springer.

  • Grossberg, S. (1973). Contour enhancement, short-term memory, and constancies in reverberating neural networks. Studies in Applied Mathematics, 52, 213–257.

    Article  Google Scholar 

  • Hall, M., & Witten, I. H. (2011). Data mining: Practical machine learning tools and techniques (3rd ed.). Burlington: Kaufmann.

    Google Scholar 

  • Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate detection using shallow text features. In B. D. Davison, T. Suel, N. Craswell, & B. Liu (Eds.), WSDM ’10: Proceedings of the third ACM international conference on web search and data mining (pp. 441–450). New York: ACM.

  • Marek, M., Pecina, P., Spousta, M. (2007). Web page cleaning with conditional random fields. In C. Fairon, H. Naets, A. Kilgarriff, & G. M. de Schryver (Eds.), Building and exploring web corpora: Proceedings of the 3rd web as corpus workshop (incorporating CLEANEVAL) (pp. 155–162). Louvain: Presses Universitaires de Louvain.

  • Minsky, M. L., & Papert, S. A. (1988). Perceptrons. Cambridge: MIT Press.

    Google Scholar 

  • Neunerdt, M., Reimer, E., Reyer, M., & Mathar, R. (2015). Enhanced web page cleaning for constructing social media text corpora. In K. J. Kim (Ed.), Information science and applications (pp. 665–672). Berlin: Springer.

    Chapter  Google Scholar 

  • Nissen, S. (2003). Implementation of a Fast Artificial Neural Network Library (FANN). Technical report. Datalogisk Institut Københavns Universitet, Copenhagen.

  • Pasternack, J., & Roth, D. (2009). Extracting article text from the web with maximum subsequence segmentation. In J. Quemada, G. León, Y. Maarek, & W. Nejdl (Eds.), WWW ’09: Proceedings of the 18th international conference on World Wide Web (pp. 971–980). Madrid: ACM.

  • Pomikalek, J., Rychly, P., & Kilgarriff, A. (2009). Scaling to billion-plus word corpora. Research in Computing Science 41, special issue: Advances in Computational Linguistics.

  • Pomikálek, J. (2011). Removing boilerplate and duplicate content from web corpora. Ph.D. thesis, Masaryk University Faculty of Informatics, Brno. http://is.muni.cz/th/45523fi_d/phdthesis.pdf.

  • Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.

    Article  Google Scholar 

  • Schäfer, R. (2015). Processing and querying large web corpora with the COW14 architecture. In P. Bański, H. Biber, E. Breiteneder, M. Kupietz, H. Lüngen, & A. Witt (Eds.), Proceedings of challenges in the management of large corpora 3 (CMLC-3). UCREL, Lancaster.

  • Schäfer, R. (2016). CommonCOW: Massively huge web corpora from CommonCrawl data and a method to distribute them freely under restrictive EU copyright laws. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, J. Odijk, et al. (Eds.), Proceedings of the tenth international conference on language resources and evaluation (LREC ’16) (pp. 4500–4504). Portorož: European Language Resources Association (ELRA).

  • Schäfer, R., & Bildhauer, F. (2012). Building large corpora from the web using a new efficient tool chain. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan., B. Maegaard, J. Mariani, et al. (Eds.), Proceedings of the eighth international conference on language resources and evaluation (LREC ’12) (pp. 486–493). Istanbul: European Language Resources Association (ELRA).

  • Schäfer, R., & Bildhauer, F. (2013). Web corpus construction. Synthesis lectures on human language technologies. San Francisco: Morgan and Claypool.

    Google Scholar 

  • Spousta, M., Marek, M., & Pecina, P. (2008). Victor: The web-page cleaning tool. In S. Evert, A. Kilgarriff, & S. Sharoff (Eds.), Proceedings of the 4th web as corpus workshop (pp. 12–17). Marrakech: European Language Resources Association (ELRA).

  • Üstün, B., Melssen, W. J., & Buydens, L. M. (2006). Facilitating the application of support vector regression by using a universal Pearson VII function based kernel. Nature, 81, 29–40.

    Google Scholar 

Download references

Acknowledgments

First of all, I would like to thank the two anonymous LREV reviewers, whose comments greatly contributed to the quality of the paper. I am very grateful to the former student assistants Sarah Dietzfelbinger and Lea Helmers for their work on the annotation of the training data. I would also like to thank Felix Bildhauer for ongoing collaborative work on the COW corpora since 2011 and Stefan Müller for his support of the COW project since 2011. This work is supported by the German Research Council (Deutsche Forschungsgemeinschaft, DFG) through grant SCHA1916/1-1 Linguistic web characterization and web corpus creation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roland Schäfer.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schäfer, R. Accurate and efficient general-purpose boilerplate detection for crawled web corpora. Lang Resources & Evaluation 51, 873–889 (2017). https://doi.org/10.1007/s10579-016-9359-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-016-9359-2

Keywords

Navigation