skip to main content
10.1145/1066677.1066788acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
Article

Mining web content outliers using structure oriented weighting techniques and N-grams

Published:13 March 2005Publication History

ABSTRACT

Classifying text into predefined categories is a fundamental task in information retrieval (IR). IR and web mining techniques have been applied to categorize web pages to enable users to manage and use the huge amount of information available on the web. Thus, developing user-friendly and automated tools for managing web information has been on a higher demand in web mining and information retrieval communities. Text categorization, information routing, identification of junk materials, topic identification and structured search are some of the hot spots in web information management. A great deal of techniques exists for classifying web documents into categories. Interestingly, almost none of the existing algorithms consider documents having 'varying contents' from the rest of the documents taken from the same domain (category) called web content outliers. In this paper, we take advantage of the HTML structure of web and n-gram technique for partial matching of strings and propose an n-gram-based algorithm for mining web content outliers. To reduce the processing time, the optimized algorithm uses only data captured in <Meta> and <Title> tags. Experimental results using planted motifs indicate the proposed n-gram-based algorithm is capable of finding web content outliers. In addition, using texts captured in <Meta> and <Title> tags gave the same results as using text embedded in <Meta>, <Title>, and <Body> tags.

References

  1. Anguilli, F., and Pizzuti, C., Elomaa, T. (Eds.). Fast Outlier Detection in High Dimensional Spaces. PKDD, LNAI 2431, 2002, pp 15--27 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agyemang, M., Barker, K., & Alhajj R. Framework for Mining Web Content Outliers. Proceeding of the 19th Annual ACM Symposium on Applied Computing (ACM-SAC), Nicosia, Cyprus, 2004, pp 590--594 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Agyemang, M. and Ezeife, C. I. LSC-Mine: Algorithm for Mining Local Outliers. Proceedings of the 15th Information Resource Management Association (IRMA) International Conference, New Orleans, 2004, pp 5--8Google ScholarGoogle Scholar
  4. Barnett, V. and Lewis, T. Outliers in Statistical Data. John Willey, 1994Google ScholarGoogle Scholar
  5. Breunig, M. M., Kriegel, H-P., Ng R. T., and Sander, J. LOF: Identifying Outliers in Large Dataset. Proc. of ACM SIGMOD 2000, Dallas, TX 2000 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chakrabarti, S., Berg, M., and Dom, B. Focused Crawling: A New Approach to Topic-specific Web Resource Discovery. Computer Networks, Amsterdam, Netherlands, 1999 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cavnar B. W., Trenkle M. J. N-Gram-Based Text Categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document and Information Retrieval, 1994Google ScholarGoogle Scholar
  8. Damashek, M. Gauging Similarity with N-Grams: Language Independent Categorization of Text, Science, 267(1995) pp 843--848Google ScholarGoogle ScholarCross RefCross Ref
  9. Danile Riboni. Feature Selection for Web Page Classification. D. S. I Universita, Milano, Italy, 2002Google ScholarGoogle Scholar
  10. Hawkins, D. Identification of Outliers. Chapman and Hall, London, 1980Google ScholarGoogle ScholarCross RefCross Ref
  11. Jin, W., Tung, A. K. H., and Han, J. Mining Top-n Local Outliers in Large Databases. In Proc. of KDD 2001, San Francisco, CA, USA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Johnson, T., Kwok, I., and Ng, R. Fast Computation of 2-D Depth Contours. In Proc. of KDD 1998, pp 224--228Google ScholarGoogle Scholar
  13. Jung, J. J., & Jo, G-S. Semantic Outlier Analysis for Sessionizing Web Logs. Proceeding of 14th European Conference on Machine Learning/7th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), Cavtat -- Dubrovnik, 2004, pp 13--25Google ScholarGoogle Scholar
  14. Knorr, E. M., and Ng, R. T. A Unified Notion of Outliers: Properties and Computation. In Proc. of KDD 97, 1997, pp 219--222Google ScholarGoogle Scholar
  15. Knorr, E. M., and Ng, R. T. Algorithms for Mining Distance-Based Outliers in Large Dataset. In Proc. of 24th VLDB Conference, New York, USA, 1998 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kosala, R., and Blockeel, H. Web Mining Research: A Survey. SIGKDD Exploration, ACM July 2000 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Labrou, Y., Finin T. Experiments on using Yahoo! Categories to Describe Document, In IJCAI-1999 Workshop on Intelligence Information ExtractionGoogle ScholarGoogle Scholar
  18. Mayfield, J., McName, P. Indexing Using Both N-Grams and Words. In proceeding of NIST Special Publication 500-242: The Seventh Text Retrieval Conference (TREC 7), 1998, pp 419--224Google ScholarGoogle Scholar
  19. Ramaswamy, S., Rastogi, R., and Shim, K. Efficient Algorithms for Mining Outliers from Large Data Sets. In Proc. of ACM SIGMOD 2000, USA, pp 127--138 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ruts, I., & Rousseuw, P. (1996). Computing Depth Contours of Bivariate Points Cloud Computational Statistics and Data Analysis, 23(1), 153--168 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Tukey J. W. Exploratory Data Analysis. Addison-Wesley, 177Google ScholarGoogle Scholar
  22. Yang Y, Slattery S, and Ghani R. A Study of Approaches to Hypertext Categorization Journal of Intelligent Information Systems, 18(2/3): pp 219--241, 2002, Special Issue on Automated Text Categorization Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining web content outliers using structure oriented weighting techniques and N-grams

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SAC '05: Proceedings of the 2005 ACM symposium on Applied computing
      March 2005
      1814 pages
      ISBN:1581139640
      DOI:10.1145/1066677

      Copyright © 2005 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 March 2005

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,650of6,669submissions,25%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader