Abstract
Automated Genre Identification (AGI) of web pages is a problem of increasing importance since web genre (e.g. blog, news, e-shops, etc.) information can enhance modern Information Retrieval (IR) systems. The state-of-the-art in this field considers AGI as a closed-set classification problem where a variety of web page representation and machine learning models have intensively studied. In this paper, we study AGI as an open-set classification problem which better formulates the real world conditions of exploiting AGI in practice. Focusing on the use of content information, different text representation methods (words and character n-grams) are tested. Moreover, two classification methods are examined, one-class SVM learners, used as a baseline, and an ensemble of classifiers based on random feature subspacing, originally proposed for author identification. It is demonstrated that very high precision can be achieved in open-set AGI while recall remains relatively high.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Rosso, M.: Using genre to improve web search. PhD thesis, University of North Carolina at Chapel Hill (2005)
Braslavski, P.: Combining relevance and genre-related rankings: An exploratory study. In: Proceedings of the International Workshop Towards Genreenabled Search Engines: The Impact of NLP, pp. 1–4 (2007)
Sharoff, S., Wu, Z., Markert, K.: The web library of babel: evaluating genre collections. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation, pp. 3063–3070 (2010)
Santini, M., Sharoff, S.: Web genre benchmark under construction. Journal for Language Technology and Computational Linguistics 24(1), 129–145 (2009)
Kanaris, I., Stamatatos, E.: Learning to recognize webpage genres. Information Processing & Management 45(5), 499–512 (2009)
Dong, L., Watters, C., Duffy, J., Shepherd, M.: Binary cybergenre classification using theoretic feature measures (2006)
Feldman, S., Marin, M., Medero, J., Ostendorf, M.: Classifying factored genres with part-of-speech histograms. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, Association for Computational Linguistics, pp. 173–176 (2009)
Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Language Resources and Evaluation 45(1), 83–94 (2011)
Meyer zu Eissen, S., Stein, B.: Genre Classification of Web Pages. In: Biundo, S., Frühwirth, T., Palm, G. (eds.) KI 2004. LNCS (LNAI), vol. 3238, pp. 256–269. Springer, Heidelberg (2004)
Santini, M.: Automatic identification of genre in web pages. PhD thesis, University of Brighton (2007)
Lim, C.S., Lee, K.J., Kim, G.C.: Multiple sets of features for automatic genre classification of web documents. Information Processing and Management 41(5), 1263–1276 (2005)
Mason, J., Shepherd, M., Duffy, J.: An n-gram based approach to automatically identifying web page genre. In: HICSS, pp. 1–10. IEEE Computer Society (2009)
Khan, S.S., Madden, M.G.: A Survey of Recent Trends in One Class Classification. In: Coyle, L., Freyne, J. (eds.) AICS 2009. LNCS, vol. 6206, pp. 188–197. Springer, Heidelberg (2010)
Scholkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., Williamson, R.: Estimating the support of a high-dimensional distribution. Technical Report MSR-TR-99-87 (1999)
Manevitz, L., Yousef, M.: One-class svms for document classification. The Journal of Machine Learning Research 2, 139–154 (2002)
Anderka, M., Stein, B., Lipka, N.: Detection of text quality as as a one-class classification problem. In: 20th ACM International Conference on Information and Knowledge Management (CIKM 2011), pp. 2313–2316 (2011)
Ferretti, E., Fusilier, D., Cabrera, R., y Gómez, M., Errecalde, M., Rosso, P.: On the use of pu learning for quality flaw prediction in wikipedia. In: Working Notes, CLEF 2012 Evaluation Labs and Workshop, Rome, Italy, 17-20 (2012)
Bishop, C.: Pattern Recognition and Machine Learning, 331–336 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pritsos, D.A., Stamatatos, E. (2013). Open-Set Classification for Automated Genre Identification. In: Serdyukov, P., et al. Advances in Information Retrieval. ECIR 2013. Lecture Notes in Computer Science, vol 7814. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36973-5_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-36973-5_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36972-8
Online ISBN: 978-3-642-36973-5
eBook Packages: Computer ScienceComputer Science (R0)