Abstract
Recently web pages genre identification attracts more attentions because of its importance in web searching. Most of existing works used the features extracted from web pages and applied machine learning approaches like SVM as classifier to identify the genre of web pages. However, in the case where web pages do not contain enough information, such an approach may not work well. In this paper, we consider to tackle genre identification in such situations. We propose a link-based graph model that taking into account neighboring pages but greatly reducing the noisy information by selecting an appropriate subset of neighboring pages. We evaluated this neighboring pages based classifier with other classifiers. The experiments conducted on two known corpora, and the favorable results indicated that our proposed approach is feasible.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., Raghavan, S.: Searching the web. ACM Transactions on Internet Technology, 2–43 (2001)
Boese, E., Howe, A.: Effects of web document evolution on genre classification. In: Proc. of the ACM 14th Conference on Information and Knowledge Management (2005)
Chen, G., Choi, B.: Web page genre classification. In: Proc. of 2008 ACM symposium on Applied computing, pp. 2353–2357 (2008)
Dong, L., Watters, C., Duffy, J., Shepherd, M.: An examination of genre attributes for web page classification. In: Proc. of the 41th Annual Hawaii International Conference on System Sciences, pp. 129–138 (2008)
Jeh, G., Widom, J.: Simrank: a measure of structural-context similarity. In: Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 538–543 (2002)
Kanaris, I., Stamatatos, E.: Web page genre identification using variable-length character n-grams. In: 19th IEEE International Conference on Tools with Artificial Intelligence, vol. 7(1), pp. 3–10 (2007)
Kennedy, A., Shepherd, M.: Automatic identification of home pages on the web. In: Proc. of the 38th Annual Hawaii International Conference on System Sciences, pp. 99–108 (2005)
Kleinbery, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)
Laender, A.H.F., Goncalves, M.A., Cota, R.G., Ferreira, A.A., Santos, R.L.T., Silva, A.J.C.: Keeping a digital library clean: New solutions to old problems. In: Proc. of the 8th ACM Symposium on Document Engineering, pp. 257–262 (2008)
Lin, Z., King, I., Ly, M.R.: Pagesim: A novel link-based similarity measure for the world wide web. In: Proc. of the 5th International Conference on Web Intelligence, pp. 687–693 (2006)
Pereira, D.A., Ribeiro, B.N., Ziviani, N., Alberto, H.F., Goncalves, A.M., Ferreira, A.A.: Using web information for author name disambiguation. In: Proc. of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 49–58 (2009)
Qi, X., Davison, B.D.: Web page classification: Features and algorithms. ACM Comput. Surv. 41(2), 1–31 (2009)
Salton, G., McGill, M.J.: Introduction to modern information retrieval (1986)
Santini, M.: Automatic genre identification: Towards a flexible classification scheme. In: BCS IRSG Symposium: Future Directions in Information Access (2007)
Meyer zu Eissen, S., Stein, B.: Genre classification of web pages. In: Biundo, S., Frühwirth, T., Palm, G. (eds.) KI 2004. LNCS (LNAI), vol. 3238, pp. 256–269. Springer, Heidelberg (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhu, J., Zhou, X., Fung, G. (2011). Enhance Web Pages Genre Identification Using Neighboring Pages. In: Bouguettaya, A., Hauswirth, M., Liu, L. (eds) Web Information System Engineering – WISE 2011. WISE 2011. Lecture Notes in Computer Science, vol 6997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24434-6_23
Download citation
DOI: https://doi.org/10.1007/978-3-642-24434-6_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24433-9
Online ISBN: 978-3-642-24434-6
eBook Packages: Computer ScienceComputer Science (R0)