Abstract
Web genre detection is a task that can enhance information retrieval systems by providing rich descriptions of documents and enabling more specialized queries. Most of previous studies in this field adopt the closed-set scenario where a given palette comprises all available genre labels. However this is not a realistic setup since web genres are constantly enriched with new labels and existing web genres are evolving in time. Open-set classification, where some pages used in the evaluation phase do not belong to any of the known genres, is a more realistic setup for this task. In this case, all pages not belonging to known genres can be seen as noise. This paper focuses on systematic evaluation of open-set web genre identification when the noise is either structured or unstructured. Two open-set methods combined with alternative text representation schemes and similarity measures are tested based on two benchmark corpora. Moreover, we adopt the openness test for web genre identification that enables the observation of effectiveness for a varying number of known/unknown labels.
Similar content being viewed by others
Notes
References
Abramson, M., & Aha, D.W. (2012) What’s in a url? genre classification from urls. Intelligent techniques for web personalization and recommender systems aaai technical report Association for the Advancement of Artificial Intelligence.
Asheghi, N. R. (2015). Human annotation and automatic detection of web genres. Ph.D. thesis, University of Leeds.
Asheghi, N. R,, Markert, K., & Sharoff, S. (2014) Semi-supervised graph-based genre classification for web pages. TextGraphs-9 p 39.
Bishop, C. (2006). Pattern recognition and machine learning. (pp 331–336). New York: Springer.
Boese, E. S., & Howe, A. E. (2005). Effects of web document evolution on genre classification. In: Proceedings of the 14th ACM international conference on information and knowledge management, (pp 632–639). ACM.
Braslavski, P. (2007). Combining relevance and genre-related rankings: An exploratory study. In: Proceedings of the international workshop towards genreenabled search engines: The impact of NLP pp 1–4.
Bryll, R., Gutierrez-Osuna, R., & Quek, F. (2003). Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets. Pattern Recognition, 36(6), 1291–1302.
Crowston, K., Kwaśnik, B., & Rubleske, J. (2011). Problems in the use-centered development of a taxonomy of web genres. In: Genres on the Web, (pp 69–84), Springer.
De Assis, G. T., Laender, A. H., Gonçalves, M. A., & Da Silva, A. S. (2009). A genre-aware approach to focused crawling. World Wide Web, 12(3), 285–319.
Dong, L., Watters, C., Duffy, J., & Shepherd, M. (2006). Binary cybergenre classification using theoretic feature measures. In Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on (pp. 313–316). IEEE
Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844.
Jebari, C. (2014). A pure url-based genre classification of web pages. In: 2014 25th international workshop on database and expert systems applications(DEXA), (pp 233–237). IEEE.
Jebari, C. (2015a). A combination based on owa operators for multi-label genre classification of web pages. Procesamiento del Lenguaje Natural, 54, 13–20.
Jebari, C. (2015b). Enhanced and combined centroid-based approach for multi-label genre classification of web pages. International Journal of Metaheuristics, 4(3–4), 220–243.
Joho, H., & Sanderson, M. (2004). The spirit collection: An overview of a large web collection. In:textitACM SIGIR Forum (Vol. 38, pp. 57–61).ACM.
Kanaris, I., & Stamatatos, E. (2009). Learning to recognize webpage genres. Information Processing & Management, 45(5), 499–512.
Kennedy, A., & Shepherd, M. (2005). Automatic identification of home pages on the web. In: Proceedings of the 38th annual Hawaii international conference on system sciences, 2005. HICSS’05,(pp 99c–99c). IEEE.
Koppel, M., Schler, J., & Argamon, S. (2011). Authorship attribution in the wild. Language Resources and Evaluation, 45(1), 83–94.
Koppel, M., & Winter, Y. (2014). Determining if two documents are written by the same author. Journal of the Association for Information Science and Technology, 65(1), 178–187.
Kumari, K. P., Reddy, A. V., & Fatima, S. S. (2014). Web page genre classification: Impact of n-gram lengths. International Journal of Computer Applications, 88(13), 13–17.
Levering, R., Cutler, M., & Yu, L. (2008). Using visual features for fine-grained genre classification of web pages. In: Proceedings of the 41st Annual Hawaii International Conference on System Sciences, (pp 131–131).IEEE.
Lim, C. S., Lee, K. J., & Kim, G. C. (2005). Multiple sets of features for automatic genre classification of web documents. Information Processing and Management, 41(5), 1263–1276.
Madjarov, G., Vidulin, V., Dimitrovski, I., & Kocev, D. (2015). Web genre classification via hierarchical multi-label classification. In: Intelligent Data Engineering and Automated Learning–IDEAL 2015, (pp 9–17). Springer.
Mason, J., Shepherd, M., & Duffy, J. (2009a). An n-gram based approach to automatically identifying web page genre. In: hicss, IEEE Computer Society, pp 1–10.
Mason, J., Shepherd, M., & Duffy, J. (2009b). Classifying web pages by genre: A distance function approach. In: Proceedings of the 5th international conference on web information systems and technologies (WEBIST 2009).
Mehler, A., Sharoff, S., & Santini, M. (2010). Genres on the Web: Computational models and empirical studies. Speech and Language Technology, Springer: Text.
Mendes Júnior, P. R., de Souza, R. M., Werneck, R. d. O., Stein, B. V., Pazinato, D. V., de Almeida, W. R., Penatti, O. A., Torres, R. d. S., Rocha, A. (2017). Nearest neighbors distance ratio open-set classifier. Machine Learning, 106, 359–386. https://doi.org/10.1007/s10994-016-5610-8.
Meyer zu Eissen, S., & Stein, B. (2004). Genre classification of web pages. In: KI 2004: Advances in Artificial Intelligence pp 256–269.
Nooralahzadeh, F., Brun, C., & Roux, C. (2014). Part of speech tagging for french social media data. In: COLING 2014, 25th international conference on computational linguistics, proceedings of the conference: technical papers, August 23–29, 2014, Dublin, Ireland, pp 1764–1772.
Petrenz, P., & Webber, B. (2011). Stable classification of text genres. Computational Linguistics, 37(2), 385–393.
Pritsos, D., & Stamatatos, E. (2015). The impact of noise in web genre identification. In: Experimental IR meets multilinguality, multimodality, and interaction, (pp 268–273). Springer.
Pritsos, D.A., & Stamatatos, E. (2013). Open-set classification for automated genre identification. In: Advances in information retrieval, (pp 207–217). Springer.
Priyatam, P. N., Iyengar, S., Perumal, K., & Varma, V. (2013). Don’t use a lot when little will do: Genre identification using urls. Research in Computing Science, 70, 207–218.
Rosso, M. A. (2008). User-based identification of web genres. Journal of the American Society for Information Science and Technology, 59(7), 1053–1072. https://doi.org/10.1002/asi.20798.
Santini, M. (2007). Automatic identification of genre in web pages. Ph.D. thesis, University of Brighton.
Santini, M. (2011). Cross-testing a genre classification model for the web. In: Genres on the Web, (pp 87–128). Springer.
Scheirer, W. J., de Rezende, Rocha A., Sapkota, A., & Boult, T. E. (2013). Toward open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(7), 1757–1772.
Scholkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., Williamson, R. (1999). Estimating the support of a high-dimensional distribution. Technical Report MSR-TR-99-87.
Sharoff, S., Wu, Z., & Markert, K. (2010). The web library of babel: Evaluating genre collections. In: Proceedings of the seventh conference on international language resources and evaluation, pp 3063–3070.
Shepherd, M. A., Watters, C. R., & Kennedy, A. (2004). Cybergenre: Automatic identification of home pages on the web. Journal of Web Engineering, 3(3–4), 236–251.
Stubbe, A., Ringlstetter, C., & Schulz, K. U. (2007). Genre as noise: Noise in genre. International Journal of Document Analysis and Recognition (IJDAR), 10(3–4), 199–209.
Vidulin, V., Luštrek, M., & Gams, M. (2007). Using genres to improve search engines. In: Proceedings of the international workshop towards genre-enabled search engines, pp 45–51.
Zhu, J., Zhou, X., & Fung, G. (2011). Enhance web pages genre identification using neighboring pages. In: Web information system engineering–WISE 2011, (pp 282–289). Springer.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pritsos, D., Stamatatos, E. Open set evaluation of web genre identification. Lang Resources & Evaluation 52, 949–968 (2018). https://doi.org/10.1007/s10579-018-9418-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-018-9418-y