Abstract
Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process.
As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.
- Aas, K. and Eikvil, L. 1999. Text categorisation: A survey. Tech. rep. 941. Norwegian Computing Center, Oslo, Norway.Google Scholar
- Agarwal, S. 2006. Ranking on graph data. In Proceedings of the 23rd International Conference on Machine Learning (ICML). ACM Press, New York, NY, 25--32. Google ScholarDigital Library
- Amitay, E. 1998. Using common hypertext links to identify the best phrasal description of target Web documents. In Proceedings of the SIGIR'98 Post-Conference Workshop on Hypertext Information Retrieval for the Web (Melbourne, Australia).Google Scholar
- Amitay, E., Carmel, D., Darlow, A., Lempel, R., and Soffer, A. 2003. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia (HYPERTEXT). ACM Press, New York, NY, 38--47. Google ScholarDigital Library
- Angelova, R. and Siersdorfer, S. 2006. A neighborhood-based approach for clustering of linked document collections. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 778--779. Google ScholarDigital Library
- Angelova, R. and Weikum, G. 2006. Graph-based text classification: Learn from your neighbors. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 485--492. Google ScholarDigital Library
- Armstrong, R., Freitag, D., Joachims, T., and Mitchell, T. 1995. WebWatcher: A learning apprentice for the World Wide Web. In Proceedings of the AAAI Spring Symposium on Information Gathering from Distributed, Heterogeneous Environments. AAAI Press, Menlo Park, CA, 6--12.Google Scholar
- Asirvatham, A. P. and Ravi, K. K. 2001. Web page classification based on document structure. Awarded second prize in National Level Student Paper Contest conducted by IEEE India Council.Google Scholar
- Attardi, G., Gulli, A., and Sebastiani, F. 1999. Automatic Web page categorization by link and context analysis. In Proceedings of First European Symposium on Telematics, Hypermedia and Artificial Intelligence (THAI, Varese, Italy), C. Hutchison and G. Lanzarone, Eds., 105--119.Google Scholar
- Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, 407--415. Google ScholarDigital Library
- Bennett, P. N., Dumais, S. T., and Horvitz, E. 2005. The combination of text classifiers using reliability indicators. Inform. Retriev. 8, 1, 67--100. Google ScholarDigital Library
- Berendt, B. and Hanser, C. 2007. Tags are not metadata, but “just more content”—to some people. In Proceedings of the International Conference on Weblogs and Social Media. 26--28.Google Scholar
- Blum, A. and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT). ACM Press, New York, NY, 92--100. Google ScholarDigital Library
- Broder, A., Fontoura, M., Josifovski, V., and Riedel, L. 2007a. A semantic approach to contextual advertising. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 559--566. Google ScholarDigital Library
- Broder, A. Z., Fontoura, M., Gabrilovich, E., Joshi, A., Josifovski, V., and Zhang, T. 2007b. Robust classification of rare queries using Web knowledge. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 231--238. Google ScholarDigital Library
- Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning (ICML). 89--96. Google ScholarDigital Library
- Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., and Goncalves, M. A. 2003. Combining link-based and content-based methods for Web document classification. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 394--401. Google ScholarDigital Library
- Cao, Y., Xu, J., Liu, T.-Y., Li, H., Huang, Y., and Hon, H.-W. 2006. Adapting ranking SVM to document retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 186--193. Google ScholarDigital Library
- Cardoso-Cachopo, A. and Oliveira, A. L. 2003. An empirical comparison of text categorization methods. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 2857. Springer, Berlin, Germany, 183--196.Google Scholar
- Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. 2007. Know your neighbors: Web spam detection using the Web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 423--430. Google ScholarDigital Library
- Chakrabarti, S. 2000. Data mining for hypertext: A tutorial survey. SIGKDD Explorat. Newsl. 1, 2 (Jan.), 1--11. Google ScholarDigital Library
- Chakrabarti, S. 2003. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
- Chakrabarti, S., Dom, B. E., and Indyk, P. 1998. Enhanced hypertext categorization using hyperlinks. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 307--318. Google ScholarDigital Library
- Chakrabarti, S., Joshi, M. M., Punera, K., and Pennock, D. M. 2002. The structure of broad topics on the Web. In Proceedings of the 11th International Conference on World Wide Web (WWW). ACM Press, New York, NY, 251--262. Google ScholarDigital Library
- Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. In Proceeding of the 8th International Conference on World Wide Web (WWW). Elsevier, New York, NY, 1623--1640. Google ScholarDigital Library
- Chekuri, C., Goldwasser, M., Raghavan, P., and Upfal, E. 1997. Web search using automated classification. In Proceedings of the Sixth International World Wide Web Conference (Santa Clara, CA). Poster POS725.Google Scholar
- Chen, H. and Dumais, S. 2000. Bringing order to the Web: Automatically categorizing search results. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM Press, New York, NY, 145--152. Google ScholarDigital Library
- Chen, Z., Wu, O., Zhu, M., and Hu, W. 2006. A novel Web page filtering system by combining texts and images. In Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE Computer Society Press, Los Alamitos, CA, 732--735. Google ScholarDigital Library
- Chesley, P., Vincent, B., Xu, L., and Srihari, R. K. 2006. Using verbs and adjectives to automatically classify blog sentiment. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 27--29. Technical Report SS-06-03.Google Scholar
- Chirita, P. A., Costache, S., Nejdl, W., and Handschuh, S. 2007. P-tag: Large scale automatic generation of personalized annotation tags for the Web. In Proceedings of the 16th International Conference on World Wide Web (WWW). ACM Press, New York, NY, 845--854. Google ScholarDigital Library
- Choi, B. and Yao, Z. 2005. Web page classification. In Foundations and Advances in Data Mining, W. Chu and T. Y. Lin, Eds. Studies in Fuzziness and Soft Computing, vol. 180. Springer-Verlag, Berlin, Germany, 221--274.Google Scholar
- Cohen, W. W. 2002. Improving a page classifier with anchor extraction and link analysis. In Advances in Neural Information Processing Systems, S. Becker, S. Thrun, and K. Obermayer, Eds. Vol. 15. MIT Press, Cambridge, MA, 1481--1488.Google Scholar
- Cohn, D. and Hofmann, T. 2001. The missing link—a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems (NIPS), vol. 13. MIT Press, Cambridge, MA.Google Scholar
- Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. 1998. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the Fifteenth National Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 509--516. Google ScholarDigital Library
- Davidov, D., Gabrilovich, E., and Markovitch, S. 2004. Parameterized generation of labeled datasets for text categorization based on a hierarchical directory. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 250--257. Google ScholarDigital Library
- Davison, B. D. 2000. Topical locality in the Web. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 272--279. Google ScholarDigital Library
- Davison, B. D. 2004. The potential of the metasearch engine. In Proceedings of the Annual Meeting of the American Society for Information Science and Technology. Vol. 41. American Society for Information Science & Technology, Providence, RI, 393--402.Google ScholarCross Ref
- Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 6, 391--407.Google ScholarCross Ref
- Dietterich, T. G. and Bakiri, G. 1995. Solving multiclass learning problems via error-correcting output codes. J. Artic. Intell. Res. 2, 263--286.Google ScholarCross Ref
- Doan, A., Madhavan, J., Domingos, P., and Halevy, A. 2002. Learning to map between ontologies on the semantic Web. In Proceedings of the 11th International Conference on World Wide Web (WWW). ACM, New York, NY, 662--673. Google ScholarDigital Library
- Drost, I., Bickel, S., and Scheffer, T. 2005. Discovering communities in linked data by multi-view clustering. In From Data and Information Analysis to Knowledge Engineering: Proceedings of 29th Annual Conference of the German Classification Society. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Germany, 342--349.Google Scholar
- Duda, R. O. and Hart, P. E. 1973. Pattern Classification and Scene Analysis. Wiley, New York, NY.Google Scholar
- Dumais, S. and Chen, H. 2000. Hierarchical classification of Web content. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 256--263. Google ScholarDigital Library
- Elgersma, E. and de Rijke, M. 2006. Learning to recognize blogs: A preliminary exploration. In EACL Workshop: New Text—Wikis and blogs and other dynamic text sources.Google Scholar
- Ester, M., Kriegel, H.-P., and Schubert, M. 2002. Web site mining: A new way to spot competitors, customers and suppliers in the World Wide Web. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, NY, 249--258. Google ScholarDigital Library
- Fisher, M. J. and Everson, R. M. 2003. When are links useful? Experiments in text classification. In Advances in Information Retrieval. Proceedings of the 25th European Conference on IR Research. 41--56.Google Scholar
- Fitzpatrick, L. and Dent, M. 1997. Automatic feedback using past queries: Social searching? In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 306--313. Google ScholarDigital Library
- Fürnkranz, J. 1999. Exploiting structural information for text classification on the WWW. In Proceedings of the 3rd Symposium on Intelligent Data Analysis (IDA-99), D. J. Hand, J. N. Kok, and M. R. Berthold, Eds. Lecture Notes in Computer Science, vol. 1642. Springer-Verlag, Amsterdam, The Netherlands, 487--497. Google ScholarDigital Library
- Fürnkranz, J. 2001. Hyperlink ensembles: A case study in hypertext classification. J. Inform. Fus. 1, 299--312.Google Scholar
- Fürnkranz, J. 2005. Web mining. In The Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. Springer, Berlin, Germany, 899--920.Google Scholar
- Gabrilovich, E. and Markovitch, S. 2004. Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In Proceedings of the 21st International Conference on Machine learning. ACM Press, New York, NY, 41. Google ScholarDigital Library
- Gabrilovich, E. and Markovitch, S. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 19th International Joint Conference for Artificial Intelligence (IJCAI). 1048--1053.Google Scholar
- Gabrilovich, E. and Markovitch, S. 2006. Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 1301--1306.Google Scholar
- Gabrilovich, E. and Markovitch, S. 2007. Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization. J. Mach. Learn. Res. 8, 2297--2345. Google ScholarDigital Library
- Getoor, L. and Diehl, C. 2005. Link mining: A survey. SIGKDD Explorat. Newsl. (Special Issue on Link Mining) 7, 2 (Dec.), 3--12. Google ScholarDigital Library
- Ghani, R. 2001. Combining labeled and unlabeled data for text classification with a large number of categories. In First IEEE International Conference on Data Mining (ICDM). IEEE Computer Society Press, Los Alamitos, CA, 597. Google ScholarDigital Library
- Ghani, R. 2002. Combining labeled and unlabeled data for multiclass text categorization. In Proceedings of the 19th International Conference on Machine Learning (ICML). Morgan Kaufmann, San Francisco, CA, 187--194. Google ScholarDigital Library
- Ghani, R., Slattery, S., and Yang, Y. 2001. Hypertext categorization using hyperlink patterns and meta data. In Proceedings of the 18th International Conference on Machine Learning (ICML). Morgan Kaufmann, San Francisco, CA, 178--185. Google ScholarDigital Library
- Glance, N. S. 2000. Community search assistant. In Artificial Intelligence for Web Search. AAAI Press Mento Park, CA, 29--34. Presented at the AAAI-2000 Workshop on Artificial Intelligence for Web Search, Technical Rep. WS-00-01.Google Scholar
- Glover, E. J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D. M., and Flake, G. W. 2002. Using Web structure for classifying and describing Web pages. In Proceedings of the 11th International Conference on World Wide Web. ACM Press, New York, NY, 562--569. Google ScholarDigital Library
- Golub, K. and Ardo, A. 2005. Importance of HTML structural elements and metadata in automated subject classification. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL). Lecture Notes in Computer Science, vol. 3652. Springer, Berlin, Germany, 368--378.Google Scholar
- Gövert, N., Lalmas, M., and Fuhr, N. 1999. A probabilistic description-oriented approach for categorizing Web documents. In Proceedings of the 8th International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 475--482. Google ScholarDigital Library
- Gyöngyi, Z. and Garcia-Molina, H. 2005a. Link spam alliances. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB, Trondheim, Norway). 517--528. Google ScholarDigital Library
- Gyöngyi, Z. and Garcia-Molina, H. 2005b. Web spam taxonomy. In Proceedings of the First International Workshop on Adversarial Information Retrieval (AIRWeb), B. D. Davison, Ed. Lehigh University, Department of Computer Science, Bethlehem, PA, 39--47. Technical rep. LU-CSE-05-030.Google Scholar
- Hammami, M., Chahir, Y., and Chen, L. 2003. Webguard: Web based adult content detection and filtering system. In Proceedings of the IEEE/WIC International Conference on Web Intelligence (WI). IEEE Computer Society Press, Los Alamitos, CA, 574. Google ScholarDigital Library
- Harabagiu, S. M., Pasca, M. A., and Maiorano, S. J. 2000. Experiments with open-domain textual question answering. In Proceedings of the 18th Conference on Computational Linguistics. Association for Computational Linguistics. Morristown, NJ, 292--298. Google ScholarDigital Library
- Haveliwala, T. H. 2003. Topic-sensitive PageRank: A context-sensitive ranking algorithm for Web search. IEEE Trans. Knowl. Data Eng. 15, 4, 784--796. Google ScholarDigital Library
- He, X., Zha, H., Ding, C. H. Q., and Simon, H. D. 2002. Web document clustering using hyperlink structures. Computat. Stat. Data Anal. 41, 1, 19--45.Google ScholarDigital Library
- Hermjakob, U. 2001. Parsing and question classification for question answering. In Proceedings of the ACL Workshop on Open-Domain Question Answering. 1--6. Google ScholarDigital Library
- Hofmann, T. 1999a. Probabilistic latent semantic analysis. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI, Stockholm, Sweden). 289--296. Google ScholarDigital Library
- Hofmann, T. 1999b. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 50--57. Google ScholarDigital Library
- Huang, C.-C., Chuang, S.-L., and Chien, L.-F. 2004a. Liveclassifier: Creating hierarchical text classifiers through Web corpora. In Proceedings of the 13th International Conference on World Wide Web (WWW). ACM Press, New York, NY, 184--192. Google ScholarDigital Library
- Huang, C.-C., Chuang, S.-L., and Chien, L.-F. 2004b. Using a Web-based categorization approach to generate thematic metadata from texts. ACM Trans. Asian Lang. Inform. Process. 3, 3, 190--212. Google ScholarDigital Library
- Jäschke, R., Marinho, L. B., Hotho, A., Schmidt-Thieme, L., and Stumme, G. 2007. Tag recommendations in folksonomies. In Proceedings of Knowledge Discovery in Databases: 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), J. N. Kok, J. Koronacki, R. L. de Mntaras, S. Matwin, D. Mladenic, and A. Skowron, Eds. Lecture Notes in Computer Science, vol. 4702. Springer, Berlin, Germany, 506--514.Google Scholar
- Jensen, D., Neville, J., and Gallagher, B. 2004. Why collective inference improves relational classification. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, NY, 593--598. Google ScholarDigital Library
- Joachims, T. 2002. Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, NY, 133--142. Google ScholarDigital Library
- Joachims, T., Cristianini, N., and Shawe-Taylor, J. 2001. Composite kernels for hypertext categorisation. In Proceedings of the 18th International Conference on Machine Learning (ICML), C. Brodley and A. Danyluk, Eds. Morgan Kaufmann, San Francisco, CA, 250--257. Google ScholarDigital Library
- Joachims, T., Freitag, D., and Mitchell, T. 1997. WebWatcher: A tour guide for the World Wide Web. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco, CA, 770--775.Google Scholar
- Käki, M. 2005. Findex: Search result categories help users when document ranking fails. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI). ACM Press, New York, NY, 131--140. Google ScholarDigital Library
- Kan, M.-Y. 2004. Web page classification without the Web page. In Proceedings of the 13th International World Wide Web Conference Alternate Track Papers & Posters (WWW Alt.). ACM Press, New York, NY, 262--263. Google ScholarDigital Library
- Kan, M.-Y. and Thi, H. O. N. 2005. Fast Webpage classification using URL features. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 325--326. Google ScholarDigital Library
- Kiritchenko, S. 2005. Hierarchical text categorization and its application to bioinformatics. Ph.D. dissertation. University of Ottawa, Ottawa, Ont., Canada. Google ScholarDigital Library
- Klose, A. 2004. Extracting fuzzy classification rules from partially labeled data. Soft Comput. 8, 6, 417--427. Google ScholarDigital Library
- Kohlschutter, C., Chirita, P.-A., and Nejdl, W. 2007. Utility analysis for topically biased PageRank. In Proceedings of the 16th International Conference on the World Wide Web (WWW). ACM Press, New York, NY, 1211--1212. Google ScholarDigital Library
- Kosala, R. and Blockeel, H. 2000. Web mining research: A survey. SIGKDD Explorat. Newsl. 2, 1 (June), 1--15. Google ScholarDigital Library
- Kovacevic, M., Diligenti, M., Gori, M., and Milutinovic, V. 2004. Visual adjacency multigraphs—a novel approach for a Web page classification. In Proceedings of the Workshop on Statistical Approaches to Web Mining (SAWM). 38--49.Google Scholar
- Kuncheva, L. I. 2004. Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, New York, NY. Google ScholarDigital Library
- Kurland, O. and Lee, L. 2005. PageRank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 306--313. Google ScholarDigital Library
- Kurland, O. and Lee, L. 2006. Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 83--90. Google ScholarDigital Library
- Kwok, C. C. T., Etzioni, O., and Weld, D. S. 2001. Scaling question answering to the Web. In Proceedings of the 10th International Conference on the World Wide Web (WWW). ACM Press, New York, NY, 150--161. Google ScholarDigital Library
- Kwon, O.-W. and Lee, J.-H. 2000. Web page classification based on k-nearest neighbor approach. In Proceedings of the 5th International Workshop on Information Retrieval with Asian Languages (IRAL). ACM Press, New York, NY, 9--15. Google ScholarDigital Library
- Kwon, O.-W. and Lee, J.-H. 2003. Text categorization based on k-nearest neighbor approach for Web site classification. Inform. Process. Manage. 29, 1 (Jan.), 25--44. Google ScholarDigital Library
- Leshed, G. and Kaye, J. J. 2006. Understanding how bloggers feel: Recognizing affect in blog posts. In CHI '06 Extended Abstracts on Human Factors in Computing Systems. ACM Press, New York, NY, 1019--1024. Google ScholarDigital Library
- Lindemann, C. and Littig, L. 2006. Coarse-grained classification of Web sites by their structural properties. In Proceedings of the 8th ACM International Workshop on Web Information and Data Management (WIDM). ACM Press, New York, NY, 35--42. Google ScholarDigital Library
- Liu, T.-Y., Yang, Y., Wan, H., Zeng, H.-J., Chen, Z., and Ma, W.-Y. 2005a. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explorat. Newsl. 7, 1, 36--43. Google ScholarDigital Library
- Liu, W., Xue, G.-R., Yu, Y., and Zeng, H.-J. 2005b. Importance-based Web page classification using cost-sensitive SVM. Adv. Web-Age Inform. Manage. 3739, 127--137.Google ScholarDigital Library
- Loia, V. and Senatore, S. 2006a. Personalized knowledge models using RDF-based fuzzy classification. Stud. Fuzz. Soft Comput. 197, 45--64.Google ScholarCross Ref
- Loia, V. and Senatore, S. 2006b. Proximity-based supervision for flexible Web pages categorization. In Fuzzy Logic and the Semantic Web, Elsevier, The Netherlands, 46--69.Google Scholar
- Lu, Q. and Getoor, L. 2003. Link-based classification. In Proceedings of the 20th International Conference on Machine Learning (ICML). AAAI Press, Menlo Park, CA.Google Scholar
- Luxenburger, J. and Weikum, G. 2004. Query-log based authority analysis for Web information search. In Proceedings of the 5th International Conference on Web Information Systems Engineering (WISE). Lecture Notes in Computer Science, vol. 3306. Springer, Berlin, Germany, 90--101.Google Scholar
- Macskassy, S. A. and Provost, F. 2007. Classification in networked data: A toolkit and a univariate case study. J. Mach. Learn. Res. 8, 935--983. Google ScholarDigital Library
- Maguitman, A. G., Menczer, F., Roinestad, H., and Vespignani, A. 2005. Algorithmic detection of semantic similarity. In Proceedings of the 14th International Conference on the World Wide Web (WWW). ACM Press, New York, NY, 107--116. Google ScholarDigital Library
- Menczer, F. 2005. Mapping the semantics of Web text and links. IEEE Internet Comput. 9, 3 (May/June), 27--36. Google ScholarDigital Library
- Mihalcea, R. and Liu, H. 2006. A corpus-based approach to finding happiness. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 139--144. Tech. rep. SS-06-03.Google Scholar
- Mishne, G. 2005. Experiments with mood classification in blog posts. In Proceedings of the Workshop on Stylistic Analysis of Text for Information Access.Google Scholar
- Mishne, G. and de Rijke, M. 2006. Capturing global mood levels using blog posts. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 145--152. Tech. rep. SS-06-03.Google Scholar
- Mitchell, T. M. 1997. Machine Learning. McGraw-Hill, New York, NY. Google ScholarDigital Library
- Mladenic, D. 1998. Turning Yahoo into an automatic Web-page classifier. In Proceedings of the European Conference on Artificial Intelligence (ECAI). 473--474.Google Scholar
- Mladenic, D. 1999. Text-learning and related intelligent agents: A survey. IEEE Intell. Syst. Appl. 14, 4 (July/Aug.), 44--54. Google ScholarDigital Library
- Nagarajan, M., Sheth, A., Aguilera, M., Keeton, K., Merchant, A., and Uysal, M. 2007. Altering document term vectors for classification: Ontologies as expectations of co-occurrence. In Proceedings of the 16th International Conference on World Wide Web (WWW). ACM, New York, NY, 1225--1226. Google ScholarDigital Library
- Nanno, T., Fujiki, T., Suzuki, Y., and Okumura, M. 2004. Automatically collecting, monitoring, and mining Japanese Weblogs. In Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters (WWW Alt.). ACM Press, New York, NY, 320--321. Google ScholarDigital Library
- Netscape Communications Corporation. 2008. The dmoz Open Directory Project (ODP). http://www.dmoz.org/.Google Scholar
- Nie, L., Davison, B. D., and Qi, X. 2006. Topical link analysis for Web search. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 91--98. Google ScholarDigital Library
- NIST. 2007. Text REtrieval Conference (TREC). http://trec.nist.gov/.Google Scholar
- Nowson, S. 2006. The language of Weblogs: A study of genre and individual differences. Ph.D. dissertation, University of Edinburgh, College of Science and Engineering, Edinburgh, Scotland.Google Scholar
- Oh, H.-J., Myaeng, S. H., and Lee, M.-H. 2000. A practical hypertext catergorization method using links and incrementally available class information. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 264--271. Google ScholarDigital Library
- Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The PageRank citation ranking: Bringing order to the Web. Unpublished draft. Stanford University, Stanford, CA.Google Scholar
- Park, S.-B. and Zhang, B.-T. 2003. Large scale unstructured document classification using unlabeled data and syntactic information. In Advances in Knowledge Discovery and Data Mining: 7th Pacific-Asia Conference (PAKDD). Lecture Notes in Computer Science, vol. 2637. Springer, Berlin, Germany, 88--99.Google Scholar
- Patel, C., Supekar, K., Lee, Y., and Park, E. K. 2003. OntoKhoj: A semantic Web portal for ontology searching, ranking and classification. In Proceedings of the 5th ACM International Workshop on Web Information and Data Management (WIDM). ACM, New York, NY, 58--61. Google ScholarDigital Library
- Pazzani, M., Muramatsu, J., and Billsus, D. 1996. Syskill & Webert: Identifying interesting Web sites. In Proceedings of the Thirteenth National Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 54--61.Google Scholar
- Peng, X. and Choi, B. 2002. Automatic Web page classification in a dynamic and hierarchical way. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE Computer Society Press, Los Alamitos, CA, 386--393. Google ScholarDigital Library
- Pierre, J. M. 2001. On the automated classification of Web sites. Linköping Electron. Art. Comput. Inform. Sci. 6. http://www.ep.liu.se/ea/cis/2001/001/.Google Scholar
- Qi, X. and Davison, B. D. 2006. Knowing a Web page by the company it keeps. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 228--237. Google ScholarDigital Library
- Qu, H., Pietra, A. L., and Poon, S. 2006. Automated blog classification: Challenges and pitfalls. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 184--186. Tech. rep. SS-06-03.Google Scholar
- Radlinski, F. and Joachims, T. 2005. Query chains: Learning to rank from implicit feedback. In Proceeding of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, NY, 239--248. Google ScholarDigital Library
- Riboni, D. 2002. Feature selection for Web page classification. In Proceedings of the Workshop on Web Content Mapping: A Challenge to ICT (EURASIA-ICT).Google Scholar
- Richardson, M., Prakash, A., and Brill, E. 2006. Beyond Pagerank: Machine learning for static ranking. In Proceedings of the 15th International Conference on World Wide Web (WWW). ACM Press, New York, NY, 707--715. Google ScholarDigital Library
- Rosenfeld, A., Hummel, R., and Zucker, S. 1976. Scene labeling by relaxation operations. IEEE Trans. Syst. Man Cybernet. 6, 420--433.Google ScholarCross Ref
- Roussinov, D. and Fan, W. 2005. Discretization based learning approach to information retrieval. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT). Association for Computational Linguistics, Morristown, NJ, 153--160. Google ScholarDigital Library
- Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill, New York, NY. Google ScholarDigital Library
- Sebastiani, F. 1999. A tutorial on automated text categorisation. In Proceedings of the 1st Argentinean Symposium on Artificial Intelligence (ASAI). 7--35.Google Scholar
- Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (Mar.), 1--47. Google ScholarDigital Library
- Seidenberg, J. and Rector, A. 2006. Web ontology segmentation: Analysis, classification and use. In Proceedings of the 15th International Conference on the World Wide Web (WWW). ACM, New York, NY, 13--22. Google ScholarDigital Library
- Seki, K. and Mostafa, J. 2005. An application of text categorization methods to gene ontology annotation. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 138--145. Google ScholarDigital Library
- Sen, P. and Getoor, L. 2007. Link-based classification. Tech. rep. CS-TR-4858. University of Maryland, College Park, MD.Google Scholar
- Shanks, V. and Williams, H. E. 2001. Fast categorisation of large document collections. In Proceedings of the Eighth International Symposium on String Processing and Information Retrieval (SPIRE). 194--204.Google Scholar
- Shen, D., Chen, Z., Yang, Q., Zeng, H.-J., Zhang, B., Lu, Y., and Ma, W.-Y. 2004. Web-page classification through summarization. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 242--249. Google ScholarDigital Library
- Shen, D., Sun, J.-T., Yang, Q., and Chen, Z. 2006. A comparison of implicit and explicit links for Web page classification. In Proceedings of the 15th International Conference on the World Wide Web. ACM Press, New York, NY, 643--650. Google ScholarDigital Library
- Slattery, S. and Mitchell, T. M. 2000. Discovering test set regularities in relational domains. In Proceedings of the 17th International Conference on Machine Learning (ICML). Morgan Kaufmann, San Francisco, CA, 895--902. Google ScholarDigital Library
- Sun, A. and Lim, E.-P. 2001. Hierarchical text classification and evaluation. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE Computer Society Press, Los Alamitos, CA, 521--528. Google ScholarDigital Library
- Sun, A., Lim, E.-P., and Ng, W.-K. 2002. Web classification using support vector machine. In Proceedings of the 4th International Workshop on Web Information and Data Management (WIDM). ACM Press, New York, NY, 96--99. Google ScholarDigital Library
- Sun, A., Suryanto, M. A., and Liu, Y. 2007. Blog classification using tags: An empirical study. In Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers. Lecture Notes in Computer Science, vol. 4822. Springer, Berlin, Germany, 307--316.Google Scholar
- Tan, A.-H. 1999. Text mining: The state of the art and the challenges. In Proceedings of the PAKDD Workshop on Knowledge Discoverery from Advanced Databases. 65--70.Google Scholar
- Tan, S. and Wang, Y. 2007. Combining error-correcting output codes and model-refinement for text categorization. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 699--700. Google ScholarDigital Library
- Tian, Y., Huang, T., Gao, W., Cheng, J., and Kang, P. 2003. Two-phase Web site classification based on hidden Markov tree models. In Proceedings of the IEEE/WIC International Conference on Web Intelligence (WI). IEEE Computer Society Press, Los Alamitos, CA, 227. Google ScholarDigital Library
- Tong, S. and Koller, D. 2001. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2, 45--66. Google ScholarDigital Library
- Utard, H. and Fürnkranz, J. 2005. Link-local features for hypertext classification. In Semantics, Web and Mining: Joint International Workshops, EWMF/KDO. Lecture Notes in Computer Science, vol. 4289. Springer, Berlin, Germany, 51--64.Google Scholar
- Veres, C. 2006. The language of folksonomies: What tags reveal about user classification. In Natural Language Processing and Information Systems. Lecture Notes in Computer Science, vol. 3999. Springer, Berlin/Heidelberg, Germany, 58--69.Google Scholar
- Wen, J.-R., Nie, J.-Y., and Zhang, H.-J. 2002. Query clustering using user logs. ACM Trans. Inform. Syst. 20, 1, 59--81. Google ScholarDigital Library
- Wibowo, W. and Williams, H. E. 2002a. Simple and accurate feature selection for hierarchical categorisation. In Proceedings of the 2002 ACM Symposium on Document Engineering (DocEng). ACM Press, New York, NY, 111--118. Google ScholarDigital Library
- Wibowo, W. and Williams, H. E. 2002b. Strategies for minimising errors in hierarchical Web categorisation. In Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 525--531. Google ScholarDigital Library
- Wolpert, D. 1992. Stacked generalization. Neur. Netw. 5, 241--259. Google ScholarDigital Library
- Xu, Z., King, I., and Lyu, M. R. 2007. Web page classification with heterogeneous data fusion. In Proceedings of the 16th International Conference on World Wide Web (WWW). ACM, New York, NY, 1171--1172. Google ScholarDigital Library
- Xue, G.-R., Yu, Y., Shen, D., Yang, Q., Zeng, H.-J., and Chen, Z. 2006. Reinforcing Web-object categorization through interrelationships. Data Min. Knowl. Disc. 12, 2-3, 229--248. Google ScholarDigital Library
- Yan, J., Liu, N., Zhang, B., Yan, S., Chen, Z., Cheng, Q., Fan, W., and Ma, W.-Y. 2005. OCFS: Optimal orthogonal centroid feature selection for text categorization. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 122--129. Google ScholarDigital Library
- Yang, H. and Chua, T.-S. 2004a. Effectiveness of Web page classification on finding list answers. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 522--523. Google ScholarDigital Library
- Yang, H. and Chua, T.-S. 2004b. Web-based list question answering. In Proceedings of the 20th International Conference on Computational Linguistics (COLING). Association for Computational Linguistics, Morristown, NJ, 1277--1283. Google ScholarDigital Library
- Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML). Morgan Kaufmann, San Francisco, CA, 412--420. Google ScholarDigital Library
- Yang, Y., Slattery, S., and Ghani, R. 2002. A study of approaches to hypertext categorization. J. Intell. Inform. Syst. 18, 2-3, 219--241. Google ScholarDigital Library
- Yang, Y., Zhang, J., and Kisiel, B. 2003. A scalability analysis of classifiers in text categorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 96--103. Google ScholarDigital Library
- Yu, H., Han, J., and Chang, K. C.-C. 2004. PEBL: Web page classification without negative examples. IEEE Trans. Knowl. Data Eng. 16, 1, 70--81. Google ScholarDigital Library
- Zaiane, O. R. and Strilets, A. 2002. Finding similar queries to satisfy searches based on query traces. In Proceedings of the International Workshop on Efficient Web-Based Information Systems (EWIS). Lecture Notes in Computer Science, vol. 2426. Springer, Berlin, Germany, 207--216. Google ScholarDigital Library
- Zelikovitz, S. and Hirsh, H. 2001. Using LSI for text classification in the presence of background text. In Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 113--118. Google ScholarDigital Library
- Zhang, D. and Lee, W. S. 2003. Question classification using support vector machines. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM Press, New York, NY, 26--32.Google Scholar
- Zhang, T., Popescul, A., and Dom, B. 2006. Linear prediction models with graph regularization for Web-page categorization. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, New York, NY, 821--826. Google ScholarDigital Library
- Zhu, S., Ji, X., Xu, W., and Gong, Y. 2005. Multi-labelled classification using maximum entropy method. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 274--281. Google ScholarDigital Library
- Zhu, S., Yu, K., Chi, Y., and Gong, Y. 2007. Combining content and link for classification using matrix factorization. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 487--494. Google ScholarDigital Library
- zu Eissen, S. M. and Stein, B. 2004. Genre classification of Web pages. In Proceedings of the 27th German Conference on Artificial Intelligence. Lecture Notes in Computer Science, vol. 3238. Springer, Berlin, Germany, 256--269.Google Scholar
Index Terms
- Web page classification: Features and algorithms
Recommendations
Web classification using support vector machine
WIDM '02: Proceedings of the 4th international workshop on Web information and data managementIn web classification, web pages from one or more web sites are assigned to pre-defined categories according to their content. Since web pages are more than just plain text documents, web classification methods have to consider using other context ...
Web page genre classification
SAC '08: Proceedings of the 2008 ACM symposium on Applied computingIn this paper we present an automatic genre-based Web page classification system. Unlike subject or topic based classifications, genre-based classifications focus on functional purposes and classify web pages into categories such as online shopping, ...
Implicit Links based Web Page Representation for Web Page Classification
WIMS '15: Proceedings of the 5th International Conference on Web Intelligence, Mining and SemanticsWith the rapid growth of the web's size, web page classification becomes more prominent. The representation way of a web page and contextual features used for this representation have both an impact on the classification's performance. Thus, finding an ...
Comments