Skip to main content
Log in

Open set evaluation of web genre identification

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Web genre detection is a task that can enhance information retrieval systems by providing rich descriptions of documents and enabling more specialized queries. Most of previous studies in this field adopt the closed-set scenario where a given palette comprises all available genre labels. However this is not a realistic setup since web genres are constantly enriched with new labels and existing web genres are evolving in time. Open-set classification, where some pages used in the evaluation phase do not belong to any of the known genres, is a more realistic setup for this task. In this case, all pages not belonging to known genres can be seen as noise. This paper focuses on systematic evaluation of open-set web genre identification when the noise is either structured or unstructured. Two open-set methods combined with alternative text representation schemes and similarity measures are tested based on two benchmark corpora. Moreover, we adopt the openness test for web genre identification that enables the observation of effectiveness for a varying number of known/unknown labels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://scikit-learn.org.

References

  • Abramson, M., & Aha, D.W. (2012) What’s in a url? genre classification from urls. Intelligent techniques for web personalization and recommender systems aaai technical report Association for the Advancement of Artificial Intelligence.

  • Asheghi, N. R. (2015). Human annotation and automatic detection of web genres. Ph.D. thesis, University of Leeds.

  • Asheghi, N. R,, Markert, K., & Sharoff, S. (2014) Semi-supervised graph-based genre classification for web pages. TextGraphs-9 p 39.

  • Bishop, C. (2006). Pattern recognition and machine learning. (pp 331–336). New York: Springer.

  • Boese, E. S., & Howe, A. E. (2005). Effects of web document evolution on genre classification. In: Proceedings of the 14th ACM international conference on information and knowledge management, (pp 632–639). ACM.

  • Braslavski, P. (2007). Combining relevance and genre-related rankings: An exploratory study. In: Proceedings of the international workshop towards genreenabled search engines: The impact of NLP pp 1–4.

  • Bryll, R., Gutierrez-Osuna, R., & Quek, F. (2003). Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets. Pattern Recognition, 36(6), 1291–1302.

    Article  Google Scholar 

  • Crowston, K., Kwaśnik, B., & Rubleske, J. (2011). Problems in the use-centered development of a taxonomy of web genres. In: Genres on the Web, (pp 69–84), Springer.

  • De Assis, G. T., Laender, A. H., Gonçalves, M. A., & Da Silva, A. S. (2009). A genre-aware approach to focused crawling. World Wide Web, 12(3), 285–319.

    Article  Google Scholar 

  • Dong, L., Watters, C., Duffy, J., & Shepherd, M. (2006). Binary cybergenre classification using theoretic feature measures. In Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on (pp. 313–316). IEEE

  • Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844.

    Article  Google Scholar 

  • Jebari, C. (2014). A pure url-based genre classification of web pages. In: 2014 25th international workshop on database and expert systems applications(DEXA), (pp 233–237). IEEE.

  • Jebari, C. (2015a). A combination based on owa operators for multi-label genre classification of web pages. Procesamiento del Lenguaje Natural, 54, 13–20.

    Google Scholar 

  • Jebari, C. (2015b). Enhanced and combined centroid-based approach for multi-label genre classification of web pages. International Journal of Metaheuristics, 4(3–4), 220–243.

    Article  Google Scholar 

  • Joho, H., & Sanderson, M. (2004). The spirit collection: An overview of a large web collection. In:textitACM SIGIR Forum (Vol. 38, pp. 57–61).ACM.

    Article  Google Scholar 

  • Kanaris, I., & Stamatatos, E. (2009). Learning to recognize webpage genres. Information Processing & Management, 45(5), 499–512.

    Article  Google Scholar 

  • Kennedy, A., & Shepherd, M. (2005). Automatic identification of home pages on the web. In: Proceedings of the 38th annual Hawaii international conference on system sciences, 2005. HICSS’05,(pp 99c–99c). IEEE.

  • Koppel, M., Schler, J., & Argamon, S. (2011). Authorship attribution in the wild. Language Resources and Evaluation, 45(1), 83–94.

    Article  Google Scholar 

  • Koppel, M., & Winter, Y. (2014). Determining if two documents are written by the same author. Journal of the Association for Information Science and Technology, 65(1), 178–187.

    Article  Google Scholar 

  • Kumari, K. P., Reddy, A. V., & Fatima, S. S. (2014). Web page genre classification: Impact of n-gram lengths. International Journal of Computer Applications, 88(13), 13–17.

    Article  Google Scholar 

  • Levering, R., Cutler, M., & Yu, L. (2008). Using visual features for fine-grained genre classification of web pages. In: Proceedings of the 41st Annual Hawaii International Conference on System Sciences, (pp 131–131).IEEE.

  • Lim, C. S., Lee, K. J., & Kim, G. C. (2005). Multiple sets of features for automatic genre classification of web documents. Information Processing and Management, 41(5), 1263–1276.

    Article  Google Scholar 

  • Madjarov, G., Vidulin, V., Dimitrovski, I., & Kocev, D. (2015). Web genre classification via hierarchical multi-label classification. In: Intelligent Data Engineering and Automated Learning–IDEAL 2015, (pp 9–17). Springer.

  • Mason, J., Shepherd, M., & Duffy, J. (2009a). An n-gram based approach to automatically identifying web page genre. In: hicss, IEEE Computer Society, pp 1–10.

  • Mason, J., Shepherd, M., & Duffy, J. (2009b). Classifying web pages by genre: A distance function approach. In: Proceedings of the 5th international conference on web information systems and technologies (WEBIST 2009).

  • Mehler, A., Sharoff, S., & Santini, M. (2010). Genres on the Web: Computational models and empirical studies. Speech and Language Technology, Springer: Text.

  • Mendes Júnior, P. R., de Souza, R. M., Werneck, R. d. O., Stein, B. V., Pazinato, D. V., de Almeida, W. R., Penatti, O. A., Torres, R. d. S., Rocha, A. (2017). Nearest neighbors distance ratio open-set classifier. Machine Learning, 106, 359–386. https://doi.org/10.1007/s10994-016-5610-8.

    Article  Google Scholar 

  • Meyer zu Eissen, S., & Stein, B. (2004). Genre classification of web pages. In: KI 2004: Advances in Artificial Intelligence pp 256–269.

  • Nooralahzadeh, F., Brun, C., & Roux, C. (2014). Part of speech tagging for french social media data. In: COLING 2014, 25th international conference on computational linguistics, proceedings of the conference: technical papers, August 23–29, 2014, Dublin, Ireland, pp 1764–1772.

  • Petrenz, P., & Webber, B. (2011). Stable classification of text genres. Computational Linguistics, 37(2), 385–393.

    Article  Google Scholar 

  • Pritsos, D., & Stamatatos, E. (2015). The impact of noise in web genre identification. In: Experimental IR meets multilinguality, multimodality, and interaction, (pp 268–273). Springer.

  • Pritsos, D.A., & Stamatatos, E. (2013). Open-set classification for automated genre identification. In: Advances in information retrieval, (pp 207–217). Springer.

  • Priyatam, P. N., Iyengar, S., Perumal, K., & Varma, V. (2013). Don’t use a lot when little will do: Genre identification using urls. Research in Computing Science, 70, 207–218.

    Google Scholar 

  • Rosso, M. A. (2008). User-based identification of web genres. Journal of the American Society for Information Science and Technology, 59(7), 1053–1072. https://doi.org/10.1002/asi.20798.

    Article  Google Scholar 

  • Santini, M. (2007). Automatic identification of genre in web pages. Ph.D. thesis, University of Brighton.

  • Santini, M. (2011). Cross-testing a genre classification model for the web. In: Genres on the Web, (pp 87–128). Springer.

  • Scheirer, W. J., de Rezende, Rocha A., Sapkota, A., & Boult, T. E. (2013). Toward open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(7), 1757–1772.

    Article  Google Scholar 

  • Scholkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., Williamson, R. (1999). Estimating the support of a high-dimensional distribution. Technical Report MSR-TR-99-87.

  • Sharoff, S., Wu, Z., & Markert, K. (2010). The web library of babel: Evaluating genre collections. In: Proceedings of the seventh conference on international language resources and evaluation, pp 3063–3070.

  • Shepherd, M. A., Watters, C. R., & Kennedy, A. (2004). Cybergenre: Automatic identification of home pages on the web. Journal of Web Engineering, 3(3–4), 236–251.

    Google Scholar 

  • Stubbe, A., Ringlstetter, C., & Schulz, K. U. (2007). Genre as noise: Noise in genre. International Journal of Document Analysis and Recognition (IJDAR), 10(3–4), 199–209.

    Article  Google Scholar 

  • Vidulin, V., Luštrek, M., & Gams, M. (2007). Using genres to improve search engines. In: Proceedings of the international workshop towards genre-enabled search engines, pp 45–51.

  • Zhu, J., Zhou, X., & Fung, G. (2011). Enhance web pages genre identification using neighboring pages. In: Web information system engineering–WISE 2011, (pp 282–289). Springer.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimitrios Pritsos.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pritsos, D., Stamatatos, E. Open set evaluation of web genre identification. Lang Resources & Evaluation 52, 949–968 (2018). https://doi.org/10.1007/s10579-018-9418-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-018-9418-y

Keywords

Navigation