ABSTRACT
Classifying text into predefined categories is a fundamental task in information retrieval (IR). IR and web mining techniques have been applied to categorize web pages to enable users to manage and use the huge amount of information available on the web. Thus, developing user-friendly and automated tools for managing web information has been on a higher demand in web mining and information retrieval communities. Text categorization, information routing, identification of junk materials, topic identification and structured search are some of the hot spots in web information management. A great deal of techniques exists for classifying web documents into categories. Interestingly, almost none of the existing algorithms consider documents having 'varying contents' from the rest of the documents taken from the same domain (category) called web content outliers. In this paper, we take advantage of the HTML structure of web and n-gram technique for partial matching of strings and propose an n-gram-based algorithm for mining web content outliers. To reduce the processing time, the optimized algorithm uses only data captured in <Meta> and <Title> tags. Experimental results using planted motifs indicate the proposed n-gram-based algorithm is capable of finding web content outliers. In addition, using texts captured in <Meta> and <Title> tags gave the same results as using text embedded in <Meta>, <Title>, and <Body> tags.
- Anguilli, F., and Pizzuti, C., Elomaa, T. (Eds.). Fast Outlier Detection in High Dimensional Spaces. PKDD, LNAI 2431, 2002, pp 15--27 Google ScholarDigital Library
- Agyemang, M., Barker, K., & Alhajj R. Framework for Mining Web Content Outliers. Proceeding of the 19th Annual ACM Symposium on Applied Computing (ACM-SAC), Nicosia, Cyprus, 2004, pp 590--594 Google ScholarDigital Library
- Agyemang, M. and Ezeife, C. I. LSC-Mine: Algorithm for Mining Local Outliers. Proceedings of the 15th Information Resource Management Association (IRMA) International Conference, New Orleans, 2004, pp 5--8Google Scholar
- Barnett, V. and Lewis, T. Outliers in Statistical Data. John Willey, 1994Google Scholar
- Breunig, M. M., Kriegel, H-P., Ng R. T., and Sander, J. LOF: Identifying Outliers in Large Dataset. Proc. of ACM SIGMOD 2000, Dallas, TX 2000 Google ScholarDigital Library
- Chakrabarti, S., Berg, M., and Dom, B. Focused Crawling: A New Approach to Topic-specific Web Resource Discovery. Computer Networks, Amsterdam, Netherlands, 1999 Google ScholarDigital Library
- Cavnar B. W., Trenkle M. J. N-Gram-Based Text Categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document and Information Retrieval, 1994Google Scholar
- Damashek, M. Gauging Similarity with N-Grams: Language Independent Categorization of Text, Science, 267(1995) pp 843--848Google ScholarCross Ref
- Danile Riboni. Feature Selection for Web Page Classification. D. S. I Universita, Milano, Italy, 2002Google Scholar
- Hawkins, D. Identification of Outliers. Chapman and Hall, London, 1980Google ScholarCross Ref
- Jin, W., Tung, A. K. H., and Han, J. Mining Top-n Local Outliers in Large Databases. In Proc. of KDD 2001, San Francisco, CA, USA, 2001. Google ScholarDigital Library
- Johnson, T., Kwok, I., and Ng, R. Fast Computation of 2-D Depth Contours. In Proc. of KDD 1998, pp 224--228Google Scholar
- Jung, J. J., & Jo, G-S. Semantic Outlier Analysis for Sessionizing Web Logs. Proceeding of 14th European Conference on Machine Learning/7th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), Cavtat -- Dubrovnik, 2004, pp 13--25Google Scholar
- Knorr, E. M., and Ng, R. T. A Unified Notion of Outliers: Properties and Computation. In Proc. of KDD 97, 1997, pp 219--222Google Scholar
- Knorr, E. M., and Ng, R. T. Algorithms for Mining Distance-Based Outliers in Large Dataset. In Proc. of 24th VLDB Conference, New York, USA, 1998 Google ScholarDigital Library
- Kosala, R., and Blockeel, H. Web Mining Research: A Survey. SIGKDD Exploration, ACM July 2000 Google ScholarDigital Library
- Labrou, Y., Finin T. Experiments on using Yahoo! Categories to Describe Document, In IJCAI-1999 Workshop on Intelligence Information ExtractionGoogle Scholar
- Mayfield, J., McName, P. Indexing Using Both N-Grams and Words. In proceeding of NIST Special Publication 500-242: The Seventh Text Retrieval Conference (TREC 7), 1998, pp 419--224Google Scholar
- Ramaswamy, S., Rastogi, R., and Shim, K. Efficient Algorithms for Mining Outliers from Large Data Sets. In Proc. of ACM SIGMOD 2000, USA, pp 127--138 Google ScholarDigital Library
- Ruts, I., & Rousseuw, P. (1996). Computing Depth Contours of Bivariate Points Cloud Computational Statistics and Data Analysis, 23(1), 153--168 Google ScholarDigital Library
- Tukey J. W. Exploratory Data Analysis. Addison-Wesley, 177Google Scholar
- Yang Y, Slattery S, and Ghani R. A Study of Approaches to Hypertext Categorization Journal of Intelligent Information Systems, 18(2/3): pp 219--241, 2002, Special Issue on Automated Text Categorization Google ScholarDigital Library
Index Terms
- Mining web content outliers using structure oriented weighting techniques and N-grams
Recommendations
Knowledge Discovery and Retrieval on World Wide Web Using Web Structure Mining
AMS '10: Proceedings of the 2010 Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer SimulationThe World Wide Web is nearing omnipresence. The explosively growing number of Web contents including Digitalized manuals, emails pictures, multimedia, and Web services require a distinct and elaborate structural framework that can provide a navigational ...
Effectual Web Content Mining using Noise Removal from Web Pages
Web mining is an emerging research area due to the rapid growth of websites. Web mining is classified into Web Content Mining (WCM), Web Usage Mining and Web Structure Mining. Extraction of required information from web page content available on World ...
Framework for mining web content outliers
SAC '04: Proceedings of the 2004 ACM symposium on Applied computingOutliers are data objects with different characteristics compared to other data objects. Exploring the diverse and dynamic web data for outliers is more interesting than finding outliers in numeric data sets. Interestingly, the existing web mining ...
Comments