research-article

Web page classification: Features and algorithms

Authors:
Xiaoguang Qi

Lehigh University, Bethlehem, PA

Lehigh University, Bethlehem, PA
View Profile

,
Brian D. Davison

Lehigh University, Bethlehem, PA

Lehigh University, Bethlehem, PA
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 41 Issue 2Article No.: 12pp 1–31https://doi.org/10.1145/1459352.1459357

Published:23 February 2009Publication History

ACM Computing Surveys

Abstract

Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process.

As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.

References

Aas, K. and Eikvil, L. 1999. Text categorisation: A survey. Tech. rep. 941. Norwegian Computing Center, Oslo, Norway.Google Scholar
Agarwal, S. 2006. Ranking on graph data. In Proceedings of the 23rd International Conference on Machine Learning (ICML). ACM Press, New York, NY, 25--32. Google ScholarDigital Library
Amitay, E. 1998. Using common hypertext links to identify the best phrasal description of target Web documents. In Proceedings of the SIGIR'98 Post-Conference Workshop on Hypertext Information Retrieval for the Web (Melbourne, Australia).Google Scholar
Amitay, E., Carmel, D., Darlow, A., Lempel, R., and Soffer, A. 2003. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia (HYPERTEXT). ACM Press, New York, NY, 38--47. Google ScholarDigital Library
Angelova, R. and Siersdorfer, S. 2006. A neighborhood-based approach for clustering of linked document collections. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 778--779. Google ScholarDigital Library
Angelova, R. and Weikum, G. 2006. Graph-based text classification: Learn from your neighbors. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 485--492. Google ScholarDigital Library
Armstrong, R., Freitag, D., Joachims, T., and Mitchell, T. 1995. WebWatcher: A learning apprentice for the World Wide Web. In Proceedings of the AAAI Spring Symposium on Information Gathering from Distributed, Heterogeneous Environments. AAAI Press, Menlo Park, CA, 6--12.Google Scholar
Asirvatham, A. P. and Ravi, K. K. 2001. Web page classification based on document structure. Awarded second prize in National Level Student Paper Contest conducted by IEEE India Council.Google Scholar
Attardi, G., Gulli, A., and Sebastiani, F. 1999. Automatic Web page categorization by link and context analysis. In Proceedings of First European Symposium on Telematics, Hypermedia and Artificial Intelligence (THAI, Varese, Italy), C. Hutchison and G. Lanzarone, Eds., 105--119.Google Scholar
Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, 407--415. Google ScholarDigital Library
Bennett, P. N., Dumais, S. T., and Horvitz, E. 2005. The combination of text classifiers using reliability indicators. Inform. Retriev. 8, 1, 67--100. Google ScholarDigital Library
Berendt, B. and Hanser, C. 2007. Tags are not metadata, but “just more content”—to some people. In Proceedings of the International Conference on Weblogs and Social Media. 26--28.Google Scholar
Blum, A. and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT). ACM Press, New York, NY, 92--100. Google ScholarDigital Library
Broder, A., Fontoura, M., Josifovski, V., and Riedel, L. 2007a. A semantic approach to contextual advertising. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 559--566. Google ScholarDigital Library
Broder, A. Z., Fontoura, M., Gabrilovich, E., Joshi, A., Josifovski, V., and Zhang, T. 2007b. Robust classification of rare queries using Web knowledge. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 231--238. Google ScholarDigital Library
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning (ICML). 89--96. Google ScholarDigital Library
Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., and Goncalves, M. A. 2003. Combining link-based and content-based methods for Web document classification. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 394--401. Google ScholarDigital Library
Cao, Y., Xu, J., Liu, T.-Y., Li, H., Huang, Y., and Hon, H.-W. 2006. Adapting ranking SVM to document retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 186--193. Google ScholarDigital Library
Cardoso-Cachopo, A. and Oliveira, A. L. 2003. An empirical comparison of text categorization methods. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 2857. Springer, Berlin, Germany, 183--196.Google Scholar
Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. 2007. Know your neighbors: Web spam detection using the Web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 423--430. Google ScholarDigital Library
Chakrabarti, S. 2000. Data mining for hypertext: A tutorial survey. SIGKDD Explorat. Newsl. 1, 2 (Jan.), 1--11. Google ScholarDigital Library
Chakrabarti, S. 2003. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
Chakrabarti, S., Dom, B. E., and Indyk, P. 1998. Enhanced hypertext categorization using hyperlinks. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 307--318. Google ScholarDigital Library
Chakrabarti, S., Joshi, M. M., Punera, K., and Pennock, D. M. 2002. The structure of broad topics on the Web. In Proceedings of the 11th International Conference on World Wide Web (WWW). ACM Press, New York, NY, 251--262. Google ScholarDigital Library
Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. In Proceeding of the 8th International Conference on World Wide Web (WWW). Elsevier, New York, NY, 1623--1640. Google ScholarDigital Library
Chekuri, C., Goldwasser, M., Raghavan, P., and Upfal, E. 1997. Web search using automated classification. In Proceedings of the Sixth International World Wide Web Conference (Santa Clara, CA). Poster POS725.Google Scholar
Chen, H. and Dumais, S. 2000. Bringing order to the Web: Automatically categorizing search results. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM Press, New York, NY, 145--152. Google ScholarDigital Library
Chen, Z., Wu, O., Zhu, M., and Hu, W. 2006. A novel Web page filtering system by combining texts and images. In Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE Computer Society Press, Los Alamitos, CA, 732--735. Google ScholarDigital Library
Chesley, P., Vincent, B., Xu, L., and Srihari, R. K. 2006. Using verbs and adjectives to automatically classify blog sentiment. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 27--29. Technical Report SS-06-03.Google Scholar
Chirita, P. A., Costache, S., Nejdl, W., and Handschuh, S. 2007. P-tag: Large scale automatic generation of personalized annotation tags for the Web. In Proceedings of the 16th International Conference on World Wide Web (WWW). ACM Press, New York, NY, 845--854. Google ScholarDigital Library
Choi, B. and Yao, Z. 2005. Web page classification. In Foundations and Advances in Data Mining, W. Chu and T. Y. Lin, Eds. Studies in Fuzziness and Soft Computing, vol. 180. Springer-Verlag, Berlin, Germany, 221--274.Google Scholar
Cohen, W. W. 2002. Improving a page classifier with anchor extraction and link analysis. In Advances in Neural Information Processing Systems, S. Becker, S. Thrun, and K. Obermayer, Eds. Vol. 15. MIT Press, Cambridge, MA, 1481--1488.Google Scholar
Cohn, D. and Hofmann, T. 2001. The missing link—a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems (NIPS), vol. 13. MIT Press, Cambridge, MA.Google Scholar
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. 1998. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the Fifteenth National Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 509--516. Google ScholarDigital Library
Davidov, D., Gabrilovich, E., and Markovitch, S. 2004. Parameterized generation of labeled datasets for text categorization based on a hierarchical directory. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 250--257. Google ScholarDigital Library
Davison, B. D. 2000. Topical locality in the Web. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 272--279. Google ScholarDigital Library
Davison, B. D. 2004. The potential of the metasearch engine. In Proceedings of the Annual Meeting of the American Society for Information Science and Technology. Vol. 41. American Society for Information Science & Technology, Providence, RI, 393--402.Google ScholarCross Ref
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 6, 391--407.Google ScholarCross Ref
Dietterich, T. G. and Bakiri, G. 1995. Solving multiclass learning problems via error-correcting output codes. J. Artic. Intell. Res. 2, 263--286.Google ScholarCross Ref
Doan, A., Madhavan, J., Domingos, P., and Halevy, A. 2002. Learning to map between ontologies on the semantic Web. In Proceedings of the 11th International Conference on World Wide Web (WWW). ACM, New York, NY, 662--673. Google ScholarDigital Library
Drost, I., Bickel, S., and Scheffer, T. 2005. Discovering communities in linked data by multi-view clustering. In From Data and Information Analysis to Knowledge Engineering: Proceedings of 29th Annual Conference of the German Classification Society. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Germany, 342--349.Google Scholar
Duda, R. O. and Hart, P. E. 1973. Pattern Classification and Scene Analysis. Wiley, New York, NY.Google Scholar
Dumais, S. and Chen, H. 2000. Hierarchical classification of Web content. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 256--263. Google ScholarDigital Library
Elgersma, E. and de Rijke, M. 2006. Learning to recognize blogs: A preliminary exploration. In EACL Workshop: New Text—Wikis and blogs and other dynamic text sources.Google Scholar
Ester, M., Kriegel, H.-P., and Schubert, M. 2002. Web site mining: A new way to spot competitors, customers and suppliers in the World Wide Web. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, NY, 249--258. Google ScholarDigital Library
Fisher, M. J. and Everson, R. M. 2003. When are links useful&quest; Experiments in text classification. In Advances in Information Retrieval. Proceedings of the 25th European Conference on IR Research. 41--56.Google Scholar
Fitzpatrick, L. and Dent, M. 1997. Automatic feedback using past queries: Social searching&quest; In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 306--313. Google ScholarDigital Library
Fürnkranz, J. 1999. Exploiting structural information for text classification on the WWW. In Proceedings of the 3rd Symposium on Intelligent Data Analysis (IDA-99), D. J. Hand, J. N. Kok, and M. R. Berthold, Eds. Lecture Notes in Computer Science, vol. 1642. Springer-Verlag, Amsterdam, The Netherlands, 487--497. Google ScholarDigital Library
Fürnkranz, J. 2001. Hyperlink ensembles: A case study in hypertext classification. J. Inform. Fus. 1, 299--312.Google Scholar
Fürnkranz, J. 2005. Web mining. In The Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. Springer, Berlin, Germany, 899--920.Google Scholar
Gabrilovich, E. and Markovitch, S. 2004. Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In Proceedings of the 21st International Conference on Machine learning. ACM Press, New York, NY, 41. Google ScholarDigital Library
Gabrilovich, E. and Markovitch, S. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 19th International Joint Conference for Artificial Intelligence (IJCAI). 1048--1053.Google Scholar
Gabrilovich, E. and Markovitch, S. 2006. Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 1301--1306.Google Scholar
Gabrilovich, E. and Markovitch, S. 2007. Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization. J. Mach. Learn. Res. 8, 2297--2345. Google ScholarDigital Library
Getoor, L. and Diehl, C. 2005. Link mining: A survey. SIGKDD Explorat. Newsl. (Special Issue on Link Mining) 7, 2 (Dec.), 3--12. Google ScholarDigital Library
Ghani, R. 2001. Combining labeled and unlabeled data for text classification with a large number of categories. In First IEEE International Conference on Data Mining (ICDM). IEEE Computer Society Press, Los Alamitos, CA, 597. Google ScholarDigital Library
Ghani, R. 2002. Combining labeled and unlabeled data for multiclass text categorization. In Proceedings of the 19th International Conference on Machine Learning (ICML). Morgan Kaufmann, San Francisco, CA, 187--194. Google ScholarDigital Library
Ghani, R., Slattery, S., and Yang, Y. 2001. Hypertext categorization using hyperlink patterns and meta data. In Proceedings of the 18th International Conference on Machine Learning (ICML). Morgan Kaufmann, San Francisco, CA, 178--185. Google ScholarDigital Library
Glance, N. S. 2000. Community search assistant. In Artificial Intelligence for Web Search. AAAI Press Mento Park, CA, 29--34. Presented at the AAAI-2000 Workshop on Artificial Intelligence for Web Search, Technical Rep. WS-00-01.Google Scholar
Glover, E. J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D. M., and Flake, G. W. 2002. Using Web structure for classifying and describing Web pages. In Proceedings of the 11th International Conference on World Wide Web. ACM Press, New York, NY, 562--569. Google ScholarDigital Library
Golub, K. and Ardo, A. 2005. Importance of HTML structural elements and metadata in automated subject classification. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL). Lecture Notes in Computer Science, vol. 3652. Springer, Berlin, Germany, 368--378.Google Scholar
Gövert, N., Lalmas, M., and Fuhr, N. 1999. A probabilistic description-oriented approach for categorizing Web documents. In Proceedings of the 8th International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 475--482. Google ScholarDigital Library
Gyöngyi, Z. and Garcia-Molina, H. 2005a. Link spam alliances. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB, Trondheim, Norway). 517--528. Google ScholarDigital Library
Gyöngyi, Z. and Garcia-Molina, H. 2005b. Web spam taxonomy. In Proceedings of the First International Workshop on Adversarial Information Retrieval (AIRWeb), B. D. Davison, Ed. Lehigh University, Department of Computer Science, Bethlehem, PA, 39--47. Technical rep. LU-CSE-05-030.Google Scholar
Hammami, M., Chahir, Y., and Chen, L. 2003. Webguard: Web based adult content detection and filtering system. In Proceedings of the IEEE/WIC International Conference on Web Intelligence (WI). IEEE Computer Society Press, Los Alamitos, CA, 574. Google ScholarDigital Library
Harabagiu, S. M., Pasca, M. A., and Maiorano, S. J. 2000. Experiments with open-domain textual question answering. In Proceedings of the 18th Conference on Computational Linguistics. Association for Computational Linguistics. Morristown, NJ, 292--298. Google ScholarDigital Library
Haveliwala, T. H. 2003. Topic-sensitive PageRank: A context-sensitive ranking algorithm for Web search. IEEE Trans. Knowl. Data Eng. 15, 4, 784--796. Google ScholarDigital Library
He, X., Zha, H., Ding, C. H. Q., and Simon, H. D. 2002. Web document clustering using hyperlink structures. Computat. Stat. Data Anal. 41, 1, 19--45.Google ScholarDigital Library
Hermjakob, U. 2001. Parsing and question classification for question answering. In Proceedings of the ACL Workshop on Open-Domain Question Answering. 1--6. Google ScholarDigital Library
Hofmann, T. 1999a. Probabilistic latent semantic analysis. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI, Stockholm, Sweden). 289--296. Google ScholarDigital Library
Hofmann, T. 1999b. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 50--57. Google ScholarDigital Library
Huang, C.-C., Chuang, S.-L., and Chien, L.-F. 2004a. Liveclassifier: Creating hierarchical text classifiers through Web corpora. In Proceedings of the 13th International Conference on World Wide Web (WWW). ACM Press, New York, NY, 184--192. Google ScholarDigital Library
Huang, C.-C., Chuang, S.-L., and Chien, L.-F. 2004b. Using a Web-based categorization approach to generate thematic metadata from texts. ACM Trans. Asian Lang. Inform. Process. 3, 3, 190--212. Google ScholarDigital Library
Jäschke, R., Marinho, L. B., Hotho, A., Schmidt-Thieme, L., and Stumme, G. 2007. Tag recommendations in folksonomies. In Proceedings of Knowledge Discovery in Databases: 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), J. N. Kok, J. Koronacki, R. L. de Mntaras, S. Matwin, D. Mladenic, and A. Skowron, Eds. Lecture Notes in Computer Science, vol. 4702. Springer, Berlin, Germany, 506--514.Google Scholar
Jensen, D., Neville, J., and Gallagher, B. 2004. Why collective inference improves relational classification. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, NY, 593--598. Google ScholarDigital Library
Joachims, T. 2002. Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, NY, 133--142. Google ScholarDigital Library
Joachims, T., Cristianini, N., and Shawe-Taylor, J. 2001. Composite kernels for hypertext categorisation. In Proceedings of the 18th International Conference on Machine Learning (ICML), C. Brodley and A. Danyluk, Eds. Morgan Kaufmann, San Francisco, CA, 250--257. Google ScholarDigital Library
Joachims, T., Freitag, D., and Mitchell, T. 1997. WebWatcher: A tour guide for the World Wide Web. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco, CA, 770--775.Google Scholar
Käki, M. 2005. Findex: Search result categories help users when document ranking fails. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI). ACM Press, New York, NY, 131--140. Google ScholarDigital Library
Kan, M.-Y. 2004. Web page classification without the Web page. In Proceedings of the 13th International World Wide Web Conference Alternate Track Papers & Posters (WWW Alt.). ACM Press, New York, NY, 262--263. Google ScholarDigital Library
Kan, M.-Y. and Thi, H. O. N. 2005. Fast Webpage classification using URL features. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 325--326. Google ScholarDigital Library
Kiritchenko, S. 2005. Hierarchical text categorization and its application to bioinformatics. Ph.D. dissertation. University of Ottawa, Ottawa, Ont., Canada. Google ScholarDigital Library
Klose, A. 2004. Extracting fuzzy classification rules from partially labeled data. Soft Comput. 8, 6, 417--427. Google ScholarDigital Library
Kohlschutter, C., Chirita, P.-A., and Nejdl, W. 2007. Utility analysis for topically biased PageRank. In Proceedings of the 16th International Conference on the World Wide Web (WWW). ACM Press, New York, NY, 1211--1212. Google ScholarDigital Library
Kosala, R. and Blockeel, H. 2000. Web mining research: A survey. SIGKDD Explorat. Newsl. 2, 1 (June), 1--15. Google ScholarDigital Library
Kovacevic, M., Diligenti, M., Gori, M., and Milutinovic, V. 2004. Visual adjacency multigraphs—a novel approach for a Web page classification. In Proceedings of the Workshop on Statistical Approaches to Web Mining (SAWM). 38--49.Google Scholar
Kuncheva, L. I. 2004. Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, New York, NY. Google ScholarDigital Library
Kurland, O. and Lee, L. 2005. PageRank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 306--313. Google ScholarDigital Library
Kurland, O. and Lee, L. 2006. Respect my authority&excl;: HITS without hyperlinks, utilizing cluster-based language models. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 83--90. Google ScholarDigital Library
Kwok, C. C. T., Etzioni, O., and Weld, D. S. 2001. Scaling question answering to the Web. In Proceedings of the 10th International Conference on the World Wide Web (WWW). ACM Press, New York, NY, 150--161. Google ScholarDigital Library
Kwon, O.-W. and Lee, J.-H. 2000. Web page classification based on k-nearest neighbor approach. In Proceedings of the 5th International Workshop on Information Retrieval with Asian Languages (IRAL). ACM Press, New York, NY, 9--15. Google ScholarDigital Library
Kwon, O.-W. and Lee, J.-H. 2003. Text categorization based on k-nearest neighbor approach for Web site classification. Inform. Process. Manage. 29, 1 (Jan.), 25--44. Google ScholarDigital Library
Leshed, G. and Kaye, J. J. 2006. Understanding how bloggers feel: Recognizing affect in blog posts. In CHI '06 Extended Abstracts on Human Factors in Computing Systems. ACM Press, New York, NY, 1019--1024. Google ScholarDigital Library
Lindemann, C. and Littig, L. 2006. Coarse-grained classification of Web sites by their structural properties. In Proceedings of the 8th ACM International Workshop on Web Information and Data Management (WIDM). ACM Press, New York, NY, 35--42. Google ScholarDigital Library
Liu, T.-Y., Yang, Y., Wan, H., Zeng, H.-J., Chen, Z., and Ma, W.-Y. 2005a. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explorat. Newsl. 7, 1, 36--43. Google ScholarDigital Library
Liu, W., Xue, G.-R., Yu, Y., and Zeng, H.-J. 2005b. Importance-based Web page classification using cost-sensitive SVM. Adv. Web-Age Inform. Manage. 3739, 127--137.Google ScholarDigital Library
Loia, V. and Senatore, S. 2006a. Personalized knowledge models using RDF-based fuzzy classification. Stud. Fuzz. Soft Comput. 197, 45--64.Google ScholarCross Ref
Loia, V. and Senatore, S. 2006b. Proximity-based supervision for flexible Web pages categorization. In Fuzzy Logic and the Semantic Web, Elsevier, The Netherlands, 46--69.Google Scholar
Lu, Q. and Getoor, L. 2003. Link-based classification. In Proceedings of the 20th International Conference on Machine Learning (ICML). AAAI Press, Menlo Park, CA.Google Scholar
Luxenburger, J. and Weikum, G. 2004. Query-log based authority analysis for Web information search. In Proceedings of the 5th International Conference on Web Information Systems Engineering (WISE). Lecture Notes in Computer Science, vol. 3306. Springer, Berlin, Germany, 90--101.Google Scholar
Macskassy, S. A. and Provost, F. 2007. Classification in networked data: A toolkit and a univariate case study. J. Mach. Learn. Res. 8, 935--983. Google ScholarDigital Library
Maguitman, A. G., Menczer, F., Roinestad, H., and Vespignani, A. 2005. Algorithmic detection of semantic similarity. In Proceedings of the 14th International Conference on the World Wide Web (WWW). ACM Press, New York, NY, 107--116. Google ScholarDigital Library
Menczer, F. 2005. Mapping the semantics of Web text and links. IEEE Internet Comput. 9, 3 (May/June), 27--36. Google ScholarDigital Library
Mihalcea, R. and Liu, H. 2006. A corpus-based approach to finding happiness. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 139--144. Tech. rep. SS-06-03.Google Scholar
Mishne, G. 2005. Experiments with mood classification in blog posts. In Proceedings of the Workshop on Stylistic Analysis of Text for Information Access.Google Scholar
Mishne, G. and de Rijke, M. 2006. Capturing global mood levels using blog posts. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 145--152. Tech. rep. SS-06-03.Google Scholar
Mitchell, T. M. 1997. Machine Learning. McGraw-Hill, New York, NY. Google ScholarDigital Library
Mladenic, D. 1998. Turning Yahoo into an automatic Web-page classifier. In Proceedings of the European Conference on Artificial Intelligence (ECAI). 473--474.Google Scholar
Mladenic, D. 1999. Text-learning and related intelligent agents: A survey. IEEE Intell. Syst. Appl. 14, 4 (July/Aug.), 44--54. Google ScholarDigital Library
Nagarajan, M., Sheth, A., Aguilera, M., Keeton, K., Merchant, A., and Uysal, M. 2007. Altering document term vectors for classification: Ontologies as expectations of co-occurrence. In Proceedings of the 16th International Conference on World Wide Web (WWW). ACM, New York, NY, 1225--1226. Google ScholarDigital Library
Nanno, T., Fujiki, T., Suzuki, Y., and Okumura, M. 2004. Automatically collecting, monitoring, and mining Japanese Weblogs. In Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters (WWW Alt.). ACM Press, New York, NY, 320--321. Google ScholarDigital Library
Netscape Communications Corporation. 2008. The dmoz Open Directory Project (ODP). http://www.dmoz.org/.Google Scholar
Nie, L., Davison, B. D., and Qi, X. 2006. Topical link analysis for Web search. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 91--98. Google ScholarDigital Library
NIST. 2007. Text REtrieval Conference (TREC). http://trec.nist.gov/.Google Scholar
Nowson, S. 2006. The language of Weblogs: A study of genre and individual differences. Ph.D. dissertation, University of Edinburgh, College of Science and Engineering, Edinburgh, Scotland.Google Scholar
Oh, H.-J., Myaeng, S. H., and Lee, M.-H. 2000. A practical hypertext catergorization method using links and incrementally available class information. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 264--271. Google ScholarDigital Library
Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The PageRank citation ranking: Bringing order to the Web. Unpublished draft. Stanford University, Stanford, CA.Google Scholar
Park, S.-B. and Zhang, B.-T. 2003. Large scale unstructured document classification using unlabeled data and syntactic information. In Advances in Knowledge Discovery and Data Mining: 7th Pacific-Asia Conference (PAKDD). Lecture Notes in Computer Science, vol. 2637. Springer, Berlin, Germany, 88--99.Google Scholar
Patel, C., Supekar, K., Lee, Y., and Park, E. K. 2003. OntoKhoj: A semantic Web portal for ontology searching, ranking and classification. In Proceedings of the 5th ACM International Workshop on Web Information and Data Management (WIDM). ACM, New York, NY, 58--61. Google ScholarDigital Library
Pazzani, M., Muramatsu, J., and Billsus, D. 1996. Syskill & Webert: Identifying interesting Web sites. In Proceedings of the Thirteenth National Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 54--61.Google Scholar
Peng, X. and Choi, B. 2002. Automatic Web page classification in a dynamic and hierarchical way. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE Computer Society Press, Los Alamitos, CA, 386--393. Google ScholarDigital Library
Pierre, J. M. 2001. On the automated classification of Web sites. Linköping Electron. Art. Comput. Inform. Sci. 6. http://www.ep.liu.se/ea/cis/2001/001/.Google Scholar
Qi, X. and Davison, B. D. 2006. Knowing a Web page by the company it keeps. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 228--237. Google ScholarDigital Library
Qu, H., Pietra, A. L., and Poon, S. 2006. Automated blog classification: Challenges and pitfalls. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 184--186. Tech. rep. SS-06-03.Google Scholar
Radlinski, F. and Joachims, T. 2005. Query chains: Learning to rank from implicit feedback. In Proceeding of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, NY, 239--248. Google ScholarDigital Library
Riboni, D. 2002. Feature selection for Web page classification. In Proceedings of the Workshop on Web Content Mapping: A Challenge to ICT (EURASIA-ICT).Google Scholar
Richardson, M., Prakash, A., and Brill, E. 2006. Beyond Pagerank: Machine learning for static ranking. In Proceedings of the 15th International Conference on World Wide Web (WWW). ACM Press, New York, NY, 707--715. Google ScholarDigital Library
Rosenfeld, A., Hummel, R., and Zucker, S. 1976. Scene labeling by relaxation operations. IEEE Trans. Syst. Man Cybernet. 6, 420--433.Google ScholarCross Ref
Roussinov, D. and Fan, W. 2005. Discretization based learning approach to information retrieval. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT). Association for Computational Linguistics, Morristown, NJ, 153--160. Google ScholarDigital Library
Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill, New York, NY. Google ScholarDigital Library
Sebastiani, F. 1999. A tutorial on automated text categorisation. In Proceedings of the 1st Argentinean Symposium on Artificial Intelligence (ASAI). 7--35.Google Scholar
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (Mar.), 1--47. Google ScholarDigital Library
Seidenberg, J. and Rector, A. 2006. Web ontology segmentation: Analysis, classification and use. In Proceedings of the 15th International Conference on the World Wide Web (WWW). ACM, New York, NY, 13--22. Google ScholarDigital Library
Seki, K. and Mostafa, J. 2005. An application of text categorization methods to gene ontology annotation. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 138--145. Google ScholarDigital Library
Sen, P. and Getoor, L. 2007. Link-based classification. Tech. rep. CS-TR-4858. University of Maryland, College Park, MD.Google Scholar
Shanks, V. and Williams, H. E. 2001. Fast categorisation of large document collections. In Proceedings of the Eighth International Symposium on String Processing and Information Retrieval (SPIRE). 194--204.Google Scholar
Shen, D., Chen, Z., Yang, Q., Zeng, H.-J., Zhang, B., Lu, Y., and Ma, W.-Y. 2004. Web-page classification through summarization. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 242--249. Google ScholarDigital Library
Shen, D., Sun, J.-T., Yang, Q., and Chen, Z. 2006. A comparison of implicit and explicit links for Web page classification. In Proceedings of the 15th International Conference on the World Wide Web. ACM Press, New York, NY, 643--650. Google ScholarDigital Library
Slattery, S. and Mitchell, T. M. 2000. Discovering test set regularities in relational domains. In Proceedings of the 17th International Conference on Machine Learning (ICML). Morgan Kaufmann, San Francisco, CA, 895--902. Google ScholarDigital Library
Sun, A. and Lim, E.-P. 2001. Hierarchical text classification and evaluation. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE Computer Society Press, Los Alamitos, CA, 521--528. Google ScholarDigital Library
Sun, A., Lim, E.-P., and Ng, W.-K. 2002. Web classification using support vector machine. In Proceedings of the 4th International Workshop on Web Information and Data Management (WIDM). ACM Press, New York, NY, 96--99. Google ScholarDigital Library
Sun, A., Suryanto, M. A., and Liu, Y. 2007. Blog classification using tags: An empirical study. In Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers. Lecture Notes in Computer Science, vol. 4822. Springer, Berlin, Germany, 307--316.Google Scholar
Tan, A.-H. 1999. Text mining: The state of the art and the challenges. In Proceedings of the PAKDD Workshop on Knowledge Discoverery from Advanced Databases. 65--70.Google Scholar
Tan, S. and Wang, Y. 2007. Combining error-correcting output codes and model-refinement for text categorization. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 699--700. Google ScholarDigital Library
Tian, Y., Huang, T., Gao, W., Cheng, J., and Kang, P. 2003. Two-phase Web site classification based on hidden Markov tree models. In Proceedings of the IEEE/WIC International Conference on Web Intelligence (WI). IEEE Computer Society Press, Los Alamitos, CA, 227. Google ScholarDigital Library
Tong, S. and Koller, D. 2001. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2, 45--66. Google ScholarDigital Library
Utard, H. and Fürnkranz, J. 2005. Link-local features for hypertext classification. In Semantics, Web and Mining: Joint International Workshops, EWMF/KDO. Lecture Notes in Computer Science, vol. 4289. Springer, Berlin, Germany, 51--64.Google Scholar
Veres, C. 2006. The language of folksonomies: What tags reveal about user classification. In Natural Language Processing and Information Systems. Lecture Notes in Computer Science, vol. 3999. Springer, Berlin/Heidelberg, Germany, 58--69.Google Scholar
Wen, J.-R., Nie, J.-Y., and Zhang, H.-J. 2002. Query clustering using user logs. ACM Trans. Inform. Syst. 20, 1, 59--81. Google ScholarDigital Library
Wibowo, W. and Williams, H. E. 2002a. Simple and accurate feature selection for hierarchical categorisation. In Proceedings of the 2002 ACM Symposium on Document Engineering (DocEng). ACM Press, New York, NY, 111--118. Google ScholarDigital Library
Wibowo, W. and Williams, H. E. 2002b. Strategies for minimising errors in hierarchical Web categorisation. In Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 525--531. Google ScholarDigital Library
Wolpert, D. 1992. Stacked generalization. Neur. Netw. 5, 241--259. Google ScholarDigital Library
Xu, Z., King, I., and Lyu, M. R. 2007. Web page classification with heterogeneous data fusion. In Proceedings of the 16th International Conference on World Wide Web (WWW). ACM, New York, NY, 1171--1172. Google ScholarDigital Library
Xue, G.-R., Yu, Y., Shen, D., Yang, Q., Zeng, H.-J., and Chen, Z. 2006. Reinforcing Web-object categorization through interrelationships. Data Min. Knowl. Disc. 12, 2-3, 229--248. Google ScholarDigital Library
Yan, J., Liu, N., Zhang, B., Yan, S., Chen, Z., Cheng, Q., Fan, W., and Ma, W.-Y. 2005. OCFS: Optimal orthogonal centroid feature selection for text categorization. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 122--129. Google ScholarDigital Library
Yang, H. and Chua, T.-S. 2004a. Effectiveness of Web page classification on finding list answers. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 522--523. Google ScholarDigital Library
Yang, H. and Chua, T.-S. 2004b. Web-based list question answering. In Proceedings of the 20th International Conference on Computational Linguistics (COLING). Association for Computational Linguistics, Morristown, NJ, 1277--1283. Google ScholarDigital Library
Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML). Morgan Kaufmann, San Francisco, CA, 412--420. Google ScholarDigital Library
Yang, Y., Slattery, S., and Ghani, R. 2002. A study of approaches to hypertext categorization. J. Intell. Inform. Syst. 18, 2-3, 219--241. Google ScholarDigital Library
Yang, Y., Zhang, J., and Kisiel, B. 2003. A scalability analysis of classifiers in text categorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 96--103. Google ScholarDigital Library
Yu, H., Han, J., and Chang, K. C.-C. 2004. PEBL: Web page classification without negative examples. IEEE Trans. Knowl. Data Eng. 16, 1, 70--81. Google ScholarDigital Library
Zaiane, O. R. and Strilets, A. 2002. Finding similar queries to satisfy searches based on query traces. In Proceedings of the International Workshop on Efficient Web-Based Information Systems (EWIS). Lecture Notes in Computer Science, vol. 2426. Springer, Berlin, Germany, 207--216. Google ScholarDigital Library
Zelikovitz, S. and Hirsh, H. 2001. Using LSI for text classification in the presence of background text. In Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 113--118. Google ScholarDigital Library
Zhang, D. and Lee, W. S. 2003. Question classification using support vector machines. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM Press, New York, NY, 26--32.Google Scholar
Zhang, T., Popescul, A., and Dom, B. 2006. Linear prediction models with graph regularization for Web-page categorization. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, New York, NY, 821--826. Google ScholarDigital Library
Zhu, S., Ji, X., Xu, W., and Gong, Y. 2005. Multi-labelled classification using maximum entropy method. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 274--281. Google ScholarDigital Library
Zhu, S., Yu, K., Chi, Y., and Gong, Y. 2007. Combining content and link for classification using matrix factorization. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 487--494. Google ScholarDigital Library
zu Eissen, S. M. and Stein, B. 2004. Genre classification of Web pages. In Proceedings of the 27th German Conference on Artificial Intelligence. Lecture Notes in Computer Science, vol. 3238. Springer, Berlin, Germany, 256--269.Google Scholar

Index Terms

Web page classification: Features and algorithms

Recommendations

Web classification using support vector machine
WIDM '02: Proceedings of the 4th international workshop on Web information and data management

In web classification, web pages from one or more web sites are assigned to pre-defined categories according to their content. Since web pages are more than just plain text documents, web classification methods have to consider using other context ...
Read More
Web page genre classification
SAC '08: Proceedings of the 2008 ACM symposium on Applied computing

In this paper we present an automatic genre-based Web page classification system. Unlike subject or topic based classifications, genre-based classifications focus on functional purposes and classify web pages into categories such as online shopping, ...
Read More
Implicit Links based Web Page Representation for Web Page Classification
WIMS '15: Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics

With the rapid growth of the web's size, web page classification becomes more prominent. The representation way of a web page and contextual features used for this representation have both an impact on the classification's performance. Thus, finding an ...
Read More

Reviews

Reviewer: Rafael Corchuelo

To make it easier for people to find the pages they need, search engines crawl and index the Web. They build on keywords, which hinders their applicability to Web directories, advertisement placement, question answering, and other areas in which keywords are inherently ambiguous and do not help to discern among related and unrelated pages. Qi and Davison's paper provides a comprehensive survey of techniques that can be applied to build Web page classifiers. Such classifiers are mathematical devices trained to recognize pages about a topic. A topic does not need to be characterized by a set of keywords. More generally, a topic is characterized by a set of features that range from links that reference a page to the way the information is rendered on the screen. The paper consists of six sections. Section 1 is an introduction. Section 2 reports on a number of applications that would be greatly improved by using Web page classifiers. Section 3 surveys common features used to build Web page classifiers, and Section 4 reports on the algorithms that support them. Section 5 reports on a few miscellaneous issues, including preprocessing Web pages and gathering training datasets. Section 6 contains the conclusion. What makes this paper different from others is that it focuses on Web pages instead of text documents. A Web page is not a text document; a Web page contains information about how to render it, and a Web page has outgoing and ingoing links that provide valuable information for building classifiers. Similar papers in the literature do not take these features into account. In conclusion, I recommend this paper to researchers working in information retrieval. It is a valuable source on the state of the art of building Web page classifiers. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Computing Surveys Volume 41, Issue 2
February 2009
248 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/1459352
Issue’s Table of Contents

Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 February 2009
- Accepted: 1 May 2008
- Revised: 1 March 2008
- Received: 1 July 2007
Published in csur Volume 41, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Categorization
Web mining
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 282
  Total Citations
  View Citations
- 10,272
  Total Downloads
- Downloads (Last 12 months)118
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Web page classification: Features and algorithms

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

Web classification using support vector machine

Web page genre classification

Implicit Links based Web Page Representation for Web Page Classification

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Web page classification: Features and algorithms

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

Web classification using support vector machine

Web page genre classification

Implicit Links based Web Page Representation for Web Page Classification

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media