skip to main content
research-article

Web page classification: Features and algorithms

Published:23 February 2009Publication History
Skip Abstract Section

Abstract

Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process.

As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.

References

  1. Aas, K. and Eikvil, L. 1999. Text categorisation: A survey. Tech. rep. 941. Norwegian Computing Center, Oslo, Norway.Google ScholarGoogle Scholar
  2. Agarwal, S. 2006. Ranking on graph data. In Proceedings of the 23rd International Conference on Machine Learning (ICML). ACM Press, New York, NY, 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Amitay, E. 1998. Using common hypertext links to identify the best phrasal description of target Web documents. In Proceedings of the SIGIR'98 Post-Conference Workshop on Hypertext Information Retrieval for the Web (Melbourne, Australia).Google ScholarGoogle Scholar
  4. Amitay, E., Carmel, D., Darlow, A., Lempel, R., and Soffer, A. 2003. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia (HYPERTEXT). ACM Press, New York, NY, 38--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Angelova, R. and Siersdorfer, S. 2006. A neighborhood-based approach for clustering of linked document collections. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 778--779. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Angelova, R. and Weikum, G. 2006. Graph-based text classification: Learn from your neighbors. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 485--492. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Armstrong, R., Freitag, D., Joachims, T., and Mitchell, T. 1995. WebWatcher: A learning apprentice for the World Wide Web. In Proceedings of the AAAI Spring Symposium on Information Gathering from Distributed, Heterogeneous Environments. AAAI Press, Menlo Park, CA, 6--12.Google ScholarGoogle Scholar
  8. Asirvatham, A. P. and Ravi, K. K. 2001. Web page classification based on document structure. Awarded second prize in National Level Student Paper Contest conducted by IEEE India Council.Google ScholarGoogle Scholar
  9. Attardi, G., Gulli, A., and Sebastiani, F. 1999. Automatic Web page categorization by link and context analysis. In Proceedings of First European Symposium on Telematics, Hypermedia and Artificial Intelligence (THAI, Varese, Italy), C. Hutchison and G. Lanzarone, Eds., 105--119.Google ScholarGoogle Scholar
  10. Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, 407--415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Bennett, P. N., Dumais, S. T., and Horvitz, E. 2005. The combination of text classifiers using reliability indicators. Inform. Retriev. 8, 1, 67--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Berendt, B. and Hanser, C. 2007. Tags are not metadata, but “just more content”—to some people. In Proceedings of the International Conference on Weblogs and Social Media. 26--28.Google ScholarGoogle Scholar
  13. Blum, A. and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT). ACM Press, New York, NY, 92--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Broder, A., Fontoura, M., Josifovski, V., and Riedel, L. 2007a. A semantic approach to contextual advertising. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 559--566. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Broder, A. Z., Fontoura, M., Gabrilovich, E., Joshi, A., Josifovski, V., and Zhang, T. 2007b. Robust classification of rare queries using Web knowledge. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 231--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning (ICML). 89--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., and Goncalves, M. A. 2003. Combining link-based and content-based methods for Web document classification. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 394--401. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Cao, Y., Xu, J., Liu, T.-Y., Li, H., Huang, Y., and Hon, H.-W. 2006. Adapting ranking SVM to document retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 186--193. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Cardoso-Cachopo, A. and Oliveira, A. L. 2003. An empirical comparison of text categorization methods. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 2857. Springer, Berlin, Germany, 183--196.Google ScholarGoogle Scholar
  20. Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. 2007. Know your neighbors: Web spam detection using the Web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 423--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Chakrabarti, S. 2000. Data mining for hypertext: A tutorial survey. SIGKDD Explorat. Newsl. 1, 2 (Jan.), 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Chakrabarti, S. 2003. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Chakrabarti, S., Dom, B. E., and Indyk, P. 1998. Enhanced hypertext categorization using hyperlinks. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 307--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Chakrabarti, S., Joshi, M. M., Punera, K., and Pennock, D. M. 2002. The structure of broad topics on the Web. In Proceedings of the 11th International Conference on World Wide Web (WWW). ACM Press, New York, NY, 251--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. In Proceeding of the 8th International Conference on World Wide Web (WWW). Elsevier, New York, NY, 1623--1640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Chekuri, C., Goldwasser, M., Raghavan, P., and Upfal, E. 1997. Web search using automated classification. In Proceedings of the Sixth International World Wide Web Conference (Santa Clara, CA). Poster POS725.Google ScholarGoogle Scholar
  27. Chen, H. and Dumais, S. 2000. Bringing order to the Web: Automatically categorizing search results. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM Press, New York, NY, 145--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Chen, Z., Wu, O., Zhu, M., and Hu, W. 2006. A novel Web page filtering system by combining texts and images. In Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE Computer Society Press, Los Alamitos, CA, 732--735. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Chesley, P., Vincent, B., Xu, L., and Srihari, R. K. 2006. Using verbs and adjectives to automatically classify blog sentiment. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 27--29. Technical Report SS-06-03.Google ScholarGoogle Scholar
  30. Chirita, P. A., Costache, S., Nejdl, W., and Handschuh, S. 2007. P-tag: Large scale automatic generation of personalized annotation tags for the Web. In Proceedings of the 16th International Conference on World Wide Web (WWW). ACM Press, New York, NY, 845--854. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Choi, B. and Yao, Z. 2005. Web page classification. In Foundations and Advances in Data Mining, W. Chu and T. Y. Lin, Eds. Studies in Fuzziness and Soft Computing, vol. 180. Springer-Verlag, Berlin, Germany, 221--274.Google ScholarGoogle Scholar
  32. Cohen, W. W. 2002. Improving a page classifier with anchor extraction and link analysis. In Advances in Neural Information Processing Systems, S. Becker, S. Thrun, and K. Obermayer, Eds. Vol. 15. MIT Press, Cambridge, MA, 1481--1488.Google ScholarGoogle Scholar
  33. Cohn, D. and Hofmann, T. 2001. The missing link—a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems (NIPS), vol. 13. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  34. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. 1998. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the Fifteenth National Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 509--516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Davidov, D., Gabrilovich, E., and Markovitch, S. 2004. Parameterized generation of labeled datasets for text categorization based on a hierarchical directory. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 250--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Davison, B. D. 2000. Topical locality in the Web. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 272--279. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Davison, B. D. 2004. The potential of the metasearch engine. In Proceedings of the Annual Meeting of the American Society for Information Science and Technology. Vol. 41. American Society for Information Science & Technology, Providence, RI, 393--402.Google ScholarGoogle ScholarCross RefCross Ref
  38. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 6, 391--407.Google ScholarGoogle ScholarCross RefCross Ref
  39. Dietterich, T. G. and Bakiri, G. 1995. Solving multiclass learning problems via error-correcting output codes. J. Artic. Intell. Res. 2, 263--286.Google ScholarGoogle ScholarCross RefCross Ref
  40. Doan, A., Madhavan, J., Domingos, P., and Halevy, A. 2002. Learning to map between ontologies on the semantic Web. In Proceedings of the 11th International Conference on World Wide Web (WWW). ACM, New York, NY, 662--673. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Drost, I., Bickel, S., and Scheffer, T. 2005. Discovering communities in linked data by multi-view clustering. In From Data and Information Analysis to Knowledge Engineering: Proceedings of 29th Annual Conference of the German Classification Society. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Germany, 342--349.Google ScholarGoogle Scholar
  42. Duda, R. O. and Hart, P. E. 1973. Pattern Classification and Scene Analysis. Wiley, New York, NY.Google ScholarGoogle Scholar
  43. Dumais, S. and Chen, H. 2000. Hierarchical classification of Web content. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 256--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Elgersma, E. and de Rijke, M. 2006. Learning to recognize blogs: A preliminary exploration. In EACL Workshop: New Text—Wikis and blogs and other dynamic text sources.Google ScholarGoogle Scholar
  45. Ester, M., Kriegel, H.-P., and Schubert, M. 2002. Web site mining: A new way to spot competitors, customers and suppliers in the World Wide Web. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, NY, 249--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Fisher, M. J. and Everson, R. M. 2003. When are links useful? Experiments in text classification. In Advances in Information Retrieval. Proceedings of the 25th European Conference on IR Research. 41--56.Google ScholarGoogle Scholar
  47. Fitzpatrick, L. and Dent, M. 1997. Automatic feedback using past queries: Social searching? In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 306--313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Fürnkranz, J. 1999. Exploiting structural information for text classification on the WWW. In Proceedings of the 3rd Symposium on Intelligent Data Analysis (IDA-99), D. J. Hand, J. N. Kok, and M. R. Berthold, Eds. Lecture Notes in Computer Science, vol. 1642. Springer-Verlag, Amsterdam, The Netherlands, 487--497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Fürnkranz, J. 2001. Hyperlink ensembles: A case study in hypertext classification. J. Inform. Fus. 1, 299--312.Google ScholarGoogle Scholar
  50. Fürnkranz, J. 2005. Web mining. In The Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. Springer, Berlin, Germany, 899--920.Google ScholarGoogle Scholar
  51. Gabrilovich, E. and Markovitch, S. 2004. Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In Proceedings of the 21st International Conference on Machine learning. ACM Press, New York, NY, 41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Gabrilovich, E. and Markovitch, S. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 19th International Joint Conference for Artificial Intelligence (IJCAI). 1048--1053.Google ScholarGoogle Scholar
  53. Gabrilovich, E. and Markovitch, S. 2006. Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 1301--1306.Google ScholarGoogle Scholar
  54. Gabrilovich, E. and Markovitch, S. 2007. Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization. J. Mach. Learn. Res. 8, 2297--2345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Getoor, L. and Diehl, C. 2005. Link mining: A survey. SIGKDD Explorat. Newsl. (Special Issue on Link Mining) 7, 2 (Dec.), 3--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Ghani, R. 2001. Combining labeled and unlabeled data for text classification with a large number of categories. In First IEEE International Conference on Data Mining (ICDM). IEEE Computer Society Press, Los Alamitos, CA, 597. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Ghani, R. 2002. Combining labeled and unlabeled data for multiclass text categorization. In Proceedings of the 19th International Conference on Machine Learning (ICML). Morgan Kaufmann, San Francisco, CA, 187--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Ghani, R., Slattery, S., and Yang, Y. 2001. Hypertext categorization using hyperlink patterns and meta data. In Proceedings of the 18th International Conference on Machine Learning (ICML). Morgan Kaufmann, San Francisco, CA, 178--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Glance, N. S. 2000. Community search assistant. In Artificial Intelligence for Web Search. AAAI Press Mento Park, CA, 29--34. Presented at the AAAI-2000 Workshop on Artificial Intelligence for Web Search, Technical Rep. WS-00-01.Google ScholarGoogle Scholar
  60. Glover, E. J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D. M., and Flake, G. W. 2002. Using Web structure for classifying and describing Web pages. In Proceedings of the 11th International Conference on World Wide Web. ACM Press, New York, NY, 562--569. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Golub, K. and Ardo, A. 2005. Importance of HTML structural elements and metadata in automated subject classification. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL). Lecture Notes in Computer Science, vol. 3652. Springer, Berlin, Germany, 368--378.Google ScholarGoogle Scholar
  62. Gövert, N., Lalmas, M., and Fuhr, N. 1999. A probabilistic description-oriented approach for categorizing Web documents. In Proceedings of the 8th International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 475--482. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Gyöngyi, Z. and Garcia-Molina, H. 2005a. Link spam alliances. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB, Trondheim, Norway). 517--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Gyöngyi, Z. and Garcia-Molina, H. 2005b. Web spam taxonomy. In Proceedings of the First International Workshop on Adversarial Information Retrieval (AIRWeb), B. D. Davison, Ed. Lehigh University, Department of Computer Science, Bethlehem, PA, 39--47. Technical rep. LU-CSE-05-030.Google ScholarGoogle Scholar
  65. Hammami, M., Chahir, Y., and Chen, L. 2003. Webguard: Web based adult content detection and filtering system. In Proceedings of the IEEE/WIC International Conference on Web Intelligence (WI). IEEE Computer Society Press, Los Alamitos, CA, 574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Harabagiu, S. M., Pasca, M. A., and Maiorano, S. J. 2000. Experiments with open-domain textual question answering. In Proceedings of the 18th Conference on Computational Linguistics. Association for Computational Linguistics. Morristown, NJ, 292--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Haveliwala, T. H. 2003. Topic-sensitive PageRank: A context-sensitive ranking algorithm for Web search. IEEE Trans. Knowl. Data Eng. 15, 4, 784--796. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. He, X., Zha, H., Ding, C. H. Q., and Simon, H. D. 2002. Web document clustering using hyperlink structures. Computat. Stat. Data Anal. 41, 1, 19--45.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Hermjakob, U. 2001. Parsing and question classification for question answering. In Proceedings of the ACL Workshop on Open-Domain Question Answering. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Hofmann, T. 1999a. Probabilistic latent semantic analysis. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI, Stockholm, Sweden). 289--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Hofmann, T. 1999b. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 50--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Huang, C.-C., Chuang, S.-L., and Chien, L.-F. 2004a. Liveclassifier: Creating hierarchical text classifiers through Web corpora. In Proceedings of the 13th International Conference on World Wide Web (WWW). ACM Press, New York, NY, 184--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Huang, C.-C., Chuang, S.-L., and Chien, L.-F. 2004b. Using a Web-based categorization approach to generate thematic metadata from texts. ACM Trans. Asian Lang. Inform. Process. 3, 3, 190--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Jäschke, R., Marinho, L. B., Hotho, A., Schmidt-Thieme, L., and Stumme, G. 2007. Tag recommendations in folksonomies. In Proceedings of Knowledge Discovery in Databases: 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), J. N. Kok, J. Koronacki, R. L. de Mntaras, S. Matwin, D. Mladenic, and A. Skowron, Eds. Lecture Notes in Computer Science, vol. 4702. Springer, Berlin, Germany, 506--514.Google ScholarGoogle Scholar
  75. Jensen, D., Neville, J., and Gallagher, B. 2004. Why collective inference improves relational classification. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, NY, 593--598. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Joachims, T. 2002. Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, NY, 133--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Joachims, T., Cristianini, N., and Shawe-Taylor, J. 2001. Composite kernels for hypertext categorisation. In Proceedings of the 18th International Conference on Machine Learning (ICML), C. Brodley and A. Danyluk, Eds. Morgan Kaufmann, San Francisco, CA, 250--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Joachims, T., Freitag, D., and Mitchell, T. 1997. WebWatcher: A tour guide for the World Wide Web. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco, CA, 770--775.Google ScholarGoogle Scholar
  79. Käki, M. 2005. Findex: Search result categories help users when document ranking fails. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI). ACM Press, New York, NY, 131--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Kan, M.-Y. 2004. Web page classification without the Web page. In Proceedings of the 13th International World Wide Web Conference Alternate Track Papers & Posters (WWW Alt.). ACM Press, New York, NY, 262--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Kan, M.-Y. and Thi, H. O. N. 2005. Fast Webpage classification using URL features. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 325--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Kiritchenko, S. 2005. Hierarchical text categorization and its application to bioinformatics. Ph.D. dissertation. University of Ottawa, Ottawa, Ont., Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Klose, A. 2004. Extracting fuzzy classification rules from partially labeled data. Soft Comput. 8, 6, 417--427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Kohlschutter, C., Chirita, P.-A., and Nejdl, W. 2007. Utility analysis for topically biased PageRank. In Proceedings of the 16th International Conference on the World Wide Web (WWW). ACM Press, New York, NY, 1211--1212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Kosala, R. and Blockeel, H. 2000. Web mining research: A survey. SIGKDD Explorat. Newsl. 2, 1 (June), 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Kovacevic, M., Diligenti, M., Gori, M., and Milutinovic, V. 2004. Visual adjacency multigraphs—a novel approach for a Web page classification. In Proceedings of the Workshop on Statistical Approaches to Web Mining (SAWM). 38--49.Google ScholarGoogle Scholar
  87. Kuncheva, L. I. 2004. Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Kurland, O. and Lee, L. 2005. PageRank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 306--313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Kurland, O. and Lee, L. 2006. Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 83--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Kwok, C. C. T., Etzioni, O., and Weld, D. S. 2001. Scaling question answering to the Web. In Proceedings of the 10th International Conference on the World Wide Web (WWW). ACM Press, New York, NY, 150--161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Kwon, O.-W. and Lee, J.-H. 2000. Web page classification based on k-nearest neighbor approach. In Proceedings of the 5th International Workshop on Information Retrieval with Asian Languages (IRAL). ACM Press, New York, NY, 9--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Kwon, O.-W. and Lee, J.-H. 2003. Text categorization based on k-nearest neighbor approach for Web site classification. Inform. Process. Manage. 29, 1 (Jan.), 25--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Leshed, G. and Kaye, J. J. 2006. Understanding how bloggers feel: Recognizing affect in blog posts. In CHI '06 Extended Abstracts on Human Factors in Computing Systems. ACM Press, New York, NY, 1019--1024. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Lindemann, C. and Littig, L. 2006. Coarse-grained classification of Web sites by their structural properties. In Proceedings of the 8th ACM International Workshop on Web Information and Data Management (WIDM). ACM Press, New York, NY, 35--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Liu, T.-Y., Yang, Y., Wan, H., Zeng, H.-J., Chen, Z., and Ma, W.-Y. 2005a. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explorat. Newsl. 7, 1, 36--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Liu, W., Xue, G.-R., Yu, Y., and Zeng, H.-J. 2005b. Importance-based Web page classification using cost-sensitive SVM. Adv. Web-Age Inform. Manage. 3739, 127--137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Loia, V. and Senatore, S. 2006a. Personalized knowledge models using RDF-based fuzzy classification. Stud. Fuzz. Soft Comput. 197, 45--64.Google ScholarGoogle ScholarCross RefCross Ref
  98. Loia, V. and Senatore, S. 2006b. Proximity-based supervision for flexible Web pages categorization. In Fuzzy Logic and the Semantic Web, Elsevier, The Netherlands, 46--69.Google ScholarGoogle Scholar
  99. Lu, Q. and Getoor, L. 2003. Link-based classification. In Proceedings of the 20th International Conference on Machine Learning (ICML). AAAI Press, Menlo Park, CA.Google ScholarGoogle Scholar
  100. Luxenburger, J. and Weikum, G. 2004. Query-log based authority analysis for Web information search. In Proceedings of the 5th International Conference on Web Information Systems Engineering (WISE). Lecture Notes in Computer Science, vol. 3306. Springer, Berlin, Germany, 90--101.Google ScholarGoogle Scholar
  101. Macskassy, S. A. and Provost, F. 2007. Classification in networked data: A toolkit and a univariate case study. J. Mach. Learn. Res. 8, 935--983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. Maguitman, A. G., Menczer, F., Roinestad, H., and Vespignani, A. 2005. Algorithmic detection of semantic similarity. In Proceedings of the 14th International Conference on the World Wide Web (WWW). ACM Press, New York, NY, 107--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Menczer, F. 2005. Mapping the semantics of Web text and links. IEEE Internet Comput. 9, 3 (May/June), 27--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. Mihalcea, R. and Liu, H. 2006. A corpus-based approach to finding happiness. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 139--144. Tech. rep. SS-06-03.Google ScholarGoogle Scholar
  105. Mishne, G. 2005. Experiments with mood classification in blog posts. In Proceedings of the Workshop on Stylistic Analysis of Text for Information Access.Google ScholarGoogle Scholar
  106. Mishne, G. and de Rijke, M. 2006. Capturing global mood levels using blog posts. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 145--152. Tech. rep. SS-06-03.Google ScholarGoogle Scholar
  107. Mitchell, T. M. 1997. Machine Learning. McGraw-Hill, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. Mladenic, D. 1998. Turning Yahoo into an automatic Web-page classifier. In Proceedings of the European Conference on Artificial Intelligence (ECAI). 473--474.Google ScholarGoogle Scholar
  109. Mladenic, D. 1999. Text-learning and related intelligent agents: A survey. IEEE Intell. Syst. Appl. 14, 4 (July/Aug.), 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. Nagarajan, M., Sheth, A., Aguilera, M., Keeton, K., Merchant, A., and Uysal, M. 2007. Altering document term vectors for classification: Ontologies as expectations of co-occurrence. In Proceedings of the 16th International Conference on World Wide Web (WWW). ACM, New York, NY, 1225--1226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Nanno, T., Fujiki, T., Suzuki, Y., and Okumura, M. 2004. Automatically collecting, monitoring, and mining Japanese Weblogs. In Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters (WWW Alt.). ACM Press, New York, NY, 320--321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. Netscape Communications Corporation. 2008. The dmoz Open Directory Project (ODP). http://www.dmoz.org/.Google ScholarGoogle Scholar
  113. Nie, L., Davison, B. D., and Qi, X. 2006. Topical link analysis for Web search. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 91--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. NIST. 2007. Text REtrieval Conference (TREC). http://trec.nist.gov/.Google ScholarGoogle Scholar
  115. Nowson, S. 2006. The language of Weblogs: A study of genre and individual differences. Ph.D. dissertation, University of Edinburgh, College of Science and Engineering, Edinburgh, Scotland.Google ScholarGoogle Scholar
  116. Oh, H.-J., Myaeng, S. H., and Lee, M.-H. 2000. A practical hypertext catergorization method using links and incrementally available class information. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 264--271. Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The PageRank citation ranking: Bringing order to the Web. Unpublished draft. Stanford University, Stanford, CA.Google ScholarGoogle Scholar
  118. Park, S.-B. and Zhang, B.-T. 2003. Large scale unstructured document classification using unlabeled data and syntactic information. In Advances in Knowledge Discovery and Data Mining: 7th Pacific-Asia Conference (PAKDD). Lecture Notes in Computer Science, vol. 2637. Springer, Berlin, Germany, 88--99.Google ScholarGoogle Scholar
  119. Patel, C., Supekar, K., Lee, Y., and Park, E. K. 2003. OntoKhoj: A semantic Web portal for ontology searching, ranking and classification. In Proceedings of the 5th ACM International Workshop on Web Information and Data Management (WIDM). ACM, New York, NY, 58--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. Pazzani, M., Muramatsu, J., and Billsus, D. 1996. Syskill & Webert: Identifying interesting Web sites. In Proceedings of the Thirteenth National Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 54--61.Google ScholarGoogle Scholar
  121. Peng, X. and Choi, B. 2002. Automatic Web page classification in a dynamic and hierarchical way. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE Computer Society Press, Los Alamitos, CA, 386--393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. Pierre, J. M. 2001. On the automated classification of Web sites. Linköping Electron. Art. Comput. Inform. Sci. 6. http://www.ep.liu.se/ea/cis/2001/001/.Google ScholarGoogle Scholar
  123. Qi, X. and Davison, B. D. 2006. Knowing a Web page by the company it keeps. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 228--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. Qu, H., Pietra, A. L., and Poon, S. 2006. Automated blog classification: Challenges and pitfalls. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 184--186. Tech. rep. SS-06-03.Google ScholarGoogle Scholar
  125. Radlinski, F. and Joachims, T. 2005. Query chains: Learning to rank from implicit feedback. In Proceeding of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, NY, 239--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Riboni, D. 2002. Feature selection for Web page classification. In Proceedings of the Workshop on Web Content Mapping: A Challenge to ICT (EURASIA-ICT).Google ScholarGoogle Scholar
  127. Richardson, M., Prakash, A., and Brill, E. 2006. Beyond Pagerank: Machine learning for static ranking. In Proceedings of the 15th International Conference on World Wide Web (WWW). ACM Press, New York, NY, 707--715. Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. Rosenfeld, A., Hummel, R., and Zucker, S. 1976. Scene labeling by relaxation operations. IEEE Trans. Syst. Man Cybernet. 6, 420--433.Google ScholarGoogle ScholarCross RefCross Ref
  129. Roussinov, D. and Fan, W. 2005. Discretization based learning approach to information retrieval. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT). Association for Computational Linguistics, Morristown, NJ, 153--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  130. Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  131. Sebastiani, F. 1999. A tutorial on automated text categorisation. In Proceedings of the 1st Argentinean Symposium on Artificial Intelligence (ASAI). 7--35.Google ScholarGoogle Scholar
  132. Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (Mar.), 1--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. Seidenberg, J. and Rector, A. 2006. Web ontology segmentation: Analysis, classification and use. In Proceedings of the 15th International Conference on the World Wide Web (WWW). ACM, New York, NY, 13--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. Seki, K. and Mostafa, J. 2005. An application of text categorization methods to gene ontology annotation. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 138--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  135. Sen, P. and Getoor, L. 2007. Link-based classification. Tech. rep. CS-TR-4858. University of Maryland, College Park, MD.Google ScholarGoogle Scholar
  136. Shanks, V. and Williams, H. E. 2001. Fast categorisation of large document collections. In Proceedings of the Eighth International Symposium on String Processing and Information Retrieval (SPIRE). 194--204.Google ScholarGoogle Scholar
  137. Shen, D., Chen, Z., Yang, Q., Zeng, H.-J., Zhang, B., Lu, Y., and Ma, W.-Y. 2004. Web-page classification through summarization. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 242--249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  138. Shen, D., Sun, J.-T., Yang, Q., and Chen, Z. 2006. A comparison of implicit and explicit links for Web page classification. In Proceedings of the 15th International Conference on the World Wide Web. ACM Press, New York, NY, 643--650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  139. Slattery, S. and Mitchell, T. M. 2000. Discovering test set regularities in relational domains. In Proceedings of the 17th International Conference on Machine Learning (ICML). Morgan Kaufmann, San Francisco, CA, 895--902. Google ScholarGoogle ScholarDigital LibraryDigital Library
  140. Sun, A. and Lim, E.-P. 2001. Hierarchical text classification and evaluation. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE Computer Society Press, Los Alamitos, CA, 521--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  141. Sun, A., Lim, E.-P., and Ng, W.-K. 2002. Web classification using support vector machine. In Proceedings of the 4th International Workshop on Web Information and Data Management (WIDM). ACM Press, New York, NY, 96--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  142. Sun, A., Suryanto, M. A., and Liu, Y. 2007. Blog classification using tags: An empirical study. In Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers. Lecture Notes in Computer Science, vol. 4822. Springer, Berlin, Germany, 307--316.Google ScholarGoogle Scholar
  143. Tan, A.-H. 1999. Text mining: The state of the art and the challenges. In Proceedings of the PAKDD Workshop on Knowledge Discoverery from Advanced Databases. 65--70.Google ScholarGoogle Scholar
  144. Tan, S. and Wang, Y. 2007. Combining error-correcting output codes and model-refinement for text categorization. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 699--700. Google ScholarGoogle ScholarDigital LibraryDigital Library
  145. Tian, Y., Huang, T., Gao, W., Cheng, J., and Kang, P. 2003. Two-phase Web site classification based on hidden Markov tree models. In Proceedings of the IEEE/WIC International Conference on Web Intelligence (WI). IEEE Computer Society Press, Los Alamitos, CA, 227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  146. Tong, S. and Koller, D. 2001. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2, 45--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  147. Utard, H. and Fürnkranz, J. 2005. Link-local features for hypertext classification. In Semantics, Web and Mining: Joint International Workshops, EWMF/KDO. Lecture Notes in Computer Science, vol. 4289. Springer, Berlin, Germany, 51--64.Google ScholarGoogle Scholar
  148. Veres, C. 2006. The language of folksonomies: What tags reveal about user classification. In Natural Language Processing and Information Systems. Lecture Notes in Computer Science, vol. 3999. Springer, Berlin/Heidelberg, Germany, 58--69.Google ScholarGoogle Scholar
  149. Wen, J.-R., Nie, J.-Y., and Zhang, H.-J. 2002. Query clustering using user logs. ACM Trans. Inform. Syst. 20, 1, 59--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  150. Wibowo, W. and Williams, H. E. 2002a. Simple and accurate feature selection for hierarchical categorisation. In Proceedings of the 2002 ACM Symposium on Document Engineering (DocEng). ACM Press, New York, NY, 111--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  151. Wibowo, W. and Williams, H. E. 2002b. Strategies for minimising errors in hierarchical Web categorisation. In Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 525--531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  152. Wolpert, D. 1992. Stacked generalization. Neur. Netw. 5, 241--259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  153. Xu, Z., King, I., and Lyu, M. R. 2007. Web page classification with heterogeneous data fusion. In Proceedings of the 16th International Conference on World Wide Web (WWW). ACM, New York, NY, 1171--1172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  154. Xue, G.-R., Yu, Y., Shen, D., Yang, Q., Zeng, H.-J., and Chen, Z. 2006. Reinforcing Web-object categorization through interrelationships. Data Min. Knowl. Disc. 12, 2-3, 229--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  155. Yan, J., Liu, N., Zhang, B., Yan, S., Chen, Z., Cheng, Q., Fan, W., and Ma, W.-Y. 2005. OCFS: Optimal orthogonal centroid feature selection for text categorization. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 122--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  156. Yang, H. and Chua, T.-S. 2004a. Effectiveness of Web page classification on finding list answers. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 522--523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  157. Yang, H. and Chua, T.-S. 2004b. Web-based list question answering. In Proceedings of the 20th International Conference on Computational Linguistics (COLING). Association for Computational Linguistics, Morristown, NJ, 1277--1283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  158. Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML). Morgan Kaufmann, San Francisco, CA, 412--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  159. Yang, Y., Slattery, S., and Ghani, R. 2002. A study of approaches to hypertext categorization. J. Intell. Inform. Syst. 18, 2-3, 219--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  160. Yang, Y., Zhang, J., and Kisiel, B. 2003. A scalability analysis of classifiers in text categorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 96--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  161. Yu, H., Han, J., and Chang, K. C.-C. 2004. PEBL: Web page classification without negative examples. IEEE Trans. Knowl. Data Eng. 16, 1, 70--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  162. Zaiane, O. R. and Strilets, A. 2002. Finding similar queries to satisfy searches based on query traces. In Proceedings of the International Workshop on Efficient Web-Based Information Systems (EWIS). Lecture Notes in Computer Science, vol. 2426. Springer, Berlin, Germany, 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  163. Zelikovitz, S. and Hirsh, H. 2001. Using LSI for text classification in the presence of background text. In Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 113--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  164. Zhang, D. and Lee, W. S. 2003. Question classification using support vector machines. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM Press, New York, NY, 26--32.Google ScholarGoogle Scholar
  165. Zhang, T., Popescul, A., and Dom, B. 2006. Linear prediction models with graph regularization for Web-page categorization. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, New York, NY, 821--826. Google ScholarGoogle ScholarDigital LibraryDigital Library
  166. Zhu, S., Ji, X., Xu, W., and Gong, Y. 2005. Multi-labelled classification using maximum entropy method. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 274--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  167. Zhu, S., Yu, K., Chi, Y., and Gong, Y. 2007. Combining content and link for classification using matrix factorization. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 487--494. Google ScholarGoogle ScholarDigital LibraryDigital Library
  168. zu Eissen, S. M. and Stein, B. 2004. Genre classification of Web pages. In Proceedings of the 27th German Conference on Artificial Intelligence. Lecture Notes in Computer Science, vol. 3238. Springer, Berlin, Germany, 256--269.Google ScholarGoogle Scholar

Index Terms

  1. Web page classification: Features and algorithms

            Recommendations

            Reviews

            Rafael Corchuelo

            To make it easier for people to find the pages they need, search engines crawl and index the Web. They build on keywords, which hinders their applicability to Web directories, advertisement placement, question answering, and other areas in which keywords are inherently ambiguous and do not help to discern among related and unrelated pages. Qi and Davison's paper provides a comprehensive survey of techniques that can be applied to build Web page classifiers. Such classifiers are mathematical devices trained to recognize pages about a topic. A topic does not need to be characterized by a set of keywords. More generally, a topic is characterized by a set of features that range from links that reference a page to the way the information is rendered on the screen. The paper consists of six sections. Section 1 is an introduction. Section 2 reports on a number of applications that would be greatly improved by using Web page classifiers. Section 3 surveys common features used to build Web page classifiers, and Section 4 reports on the algorithms that support them. Section 5 reports on a few miscellaneous issues, including preprocessing Web pages and gathering training datasets. Section 6 contains the conclusion. What makes this paper different from others is that it focuses on Web pages instead of text documents. A Web page is not a text document; a Web page contains information about how to render it, and a Web page has outgoing and ingoing links that provide valuable information for building classifiers. Similar papers in the literature do not take these features into account. In conclusion, I recommend this paper to researchers working in information retrieval. It is a valuable source on the state of the art of building Web page classifiers. Online Computing Reviews Service

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Computing Surveys
              ACM Computing Surveys  Volume 41, Issue 2
              February 2009
              248 pages
              ISSN:0360-0300
              EISSN:1557-7341
              DOI:10.1145/1459352
              Issue’s Table of Contents

              Copyright © 2009 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 23 February 2009
              • Accepted: 1 May 2008
              • Revised: 1 March 2008
              • Received: 1 July 2007
              Published in csur Volume 41, Issue 2

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader