Article

Mining web content outliers using structure oriented weighting techniques and N-grams

Authors:
Malik Agyemang

University of Calgary, AB, Canada

University of Calgary, AB, Canada
View Profile

,
Ken Barker

University of Calgary, AB, Canada

University of Calgary, AB, Canada
View Profile

,
Rada S. Alhajj

University of Calgary, AB, Canada

University of Calgary, AB, Canada
View Profile

SAC '05: Proceedings of the 2005 ACM symposium on Applied computingMarch 2005Pages 482–487https://doi.org/10.1145/1066677.1066788

Published:13 March 2005Publication History

SAC '05: Proceedings of the 2005 ACM symposium on Applied computing

Pages 482–487

ABSTRACT

Classifying text into predefined categories is a fundamental task in information retrieval (IR). IR and web mining techniques have been applied to categorize web pages to enable users to manage and use the huge amount of information available on the web. Thus, developing user-friendly and automated tools for managing web information has been on a higher demand in web mining and information retrieval communities. Text categorization, information routing, identification of junk materials, topic identification and structured search are some of the hot spots in web information management. A great deal of techniques exists for classifying web documents into categories. Interestingly, almost none of the existing algorithms consider documents having 'varying contents' from the rest of the documents taken from the same domain (category) called web content outliers. In this paper, we take advantage of the HTML structure of web and n-gram technique for partial matching of strings and propose an n-gram-based algorithm for mining web content outliers. To reduce the processing time, the optimized algorithm uses only data captured in <Meta> and <Title> tags. Experimental results using planted motifs indicate the proposed n-gram-based algorithm is capable of finding web content outliers. In addition, using texts captured in <Meta> and <Title> tags gave the same results as using text embedded in <Meta>, <Title>, and <Body> tags.

References

Anguilli, F., and Pizzuti, C., Elomaa, T. (Eds.). Fast Outlier Detection in High Dimensional Spaces. PKDD, LNAI 2431, 2002, pp 15--27 Google ScholarDigital Library
Agyemang, M., Barker, K., & Alhajj R. Framework for Mining Web Content Outliers. Proceeding of the 19th Annual ACM Symposium on Applied Computing (ACM-SAC), Nicosia, Cyprus, 2004, pp 590--594 Google ScholarDigital Library
Agyemang, M. and Ezeife, C. I. LSC-Mine: Algorithm for Mining Local Outliers. Proceedings of the 15th Information Resource Management Association (IRMA) International Conference, New Orleans, 2004, pp 5--8Google Scholar
Barnett, V. and Lewis, T. Outliers in Statistical Data. John Willey, 1994Google Scholar
Breunig, M. M., Kriegel, H-P., Ng R. T., and Sander, J. LOF: Identifying Outliers in Large Dataset. Proc. of ACM SIGMOD 2000, Dallas, TX 2000 Google ScholarDigital Library
Chakrabarti, S., Berg, M., and Dom, B. Focused Crawling: A New Approach to Topic-specific Web Resource Discovery. Computer Networks, Amsterdam, Netherlands, 1999 Google ScholarDigital Library
Cavnar B. W., Trenkle M. J. N-Gram-Based Text Categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document and Information Retrieval, 1994Google Scholar
Damashek, M. Gauging Similarity with N-Grams: Language Independent Categorization of Text, Science, 267(1995) pp 843--848Google ScholarCross Ref
Danile Riboni. Feature Selection for Web Page Classification. D. S. I Universita, Milano, Italy, 2002Google Scholar
Hawkins, D. Identification of Outliers. Chapman and Hall, London, 1980Google ScholarCross Ref
Jin, W., Tung, A. K. H., and Han, J. Mining Top-n Local Outliers in Large Databases. In Proc. of KDD 2001, San Francisco, CA, USA, 2001. Google ScholarDigital Library
Johnson, T., Kwok, I., and Ng, R. Fast Computation of 2-D Depth Contours. In Proc. of KDD 1998, pp 224--228Google Scholar
Jung, J. J., & Jo, G-S. Semantic Outlier Analysis for Sessionizing Web Logs. Proceeding of 14th European Conference on Machine Learning/7th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), Cavtat -- Dubrovnik, 2004, pp 13--25Google Scholar
Knorr, E. M., and Ng, R. T. A Unified Notion of Outliers: Properties and Computation. In Proc. of KDD 97, 1997, pp 219--222Google Scholar
Knorr, E. M., and Ng, R. T. Algorithms for Mining Distance-Based Outliers in Large Dataset. In Proc. of 24th VLDB Conference, New York, USA, 1998 Google ScholarDigital Library
Kosala, R., and Blockeel, H. Web Mining Research: A Survey. SIGKDD Exploration, ACM July 2000 Google ScholarDigital Library
Labrou, Y., Finin T. Experiments on using Yahoo! Categories to Describe Document, In IJCAI-1999 Workshop on Intelligence Information ExtractionGoogle Scholar
Mayfield, J., McName, P. Indexing Using Both N-Grams and Words. In proceeding of NIST Special Publication 500-242: The Seventh Text Retrieval Conference (TREC 7), 1998, pp 419--224Google Scholar
Ramaswamy, S., Rastogi, R., and Shim, K. Efficient Algorithms for Mining Outliers from Large Data Sets. In Proc. of ACM SIGMOD 2000, USA, pp 127--138 Google ScholarDigital Library
Ruts, I., & Rousseuw, P. (1996). Computing Depth Contours of Bivariate Points Cloud Computational Statistics and Data Analysis, 23(1), 153--168 Google ScholarDigital Library
Tukey J. W. Exploratory Data Analysis. Addison-Wesley, 177Google Scholar
Yang Y, Slattery S, and Ghani R. A Study of Approaches to Hypertext Categorization Journal of Intelligent Information Systems, 18(2/3): pp 219--241, 2002, Special Issue on Automated Text Categorization Google ScholarDigital Library

Index Terms

Mining web content outliers using structure oriented weighting techniques and N-grams
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Knowledge Discovery and Retrieval on World Wide Web Using Web Structure Mining
AMS '10: Proceedings of the 2010 Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation

The World Wide Web is nearing omnipresence. The explosively growing number of Web contents including Digitalized manuals, emails pictures, multimedia, and Web services require a distinct and elaborate structural framework that can provide a navigational ...
Read More
Effectual Web Content Mining using Noise Removal from Web Pages

Web mining is an emerging research area due to the rapid growth of websites. Web mining is classified into Web Content Mining (WCM), Web Usage Mining and Web Structure Mining. Extraction of required information from web page content available on World ...
Read More
Framework for mining web content outliers
SAC '04: Proceedings of the 2004 ACM symposium on Applied computing

Outliers are data objects with different characteristics compared to other data objects. Exploring the diverse and dynamic web data for outliers is more interesting than finding outliers in numeric data sets. Interestingly, the existing web mining ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '05: Proceedings of the 2005 ACM symposium on Applied computing
March 2005
1814 pages
ISBN:1581139640
DOI:10.1145/1066677
Conference Chair:
Hisham M. Haddad
Kennesaw State University
,
Editor:
Lorie M. Liebrock
New Mexico Institute of Mining and Technology, Socorro, NM
,
Program Chairs:
Andrea Omicini
Alma Mater Studiorum, Universita di Bologna, Italy
,
Roger L. Wainwright
Univerity of Tulsa, OK
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 March 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dissimilarity measure
n-grams
text categorization
web contents
web mining
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 1,108
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining web content outliers using structure oriented weighting techniques and N-grams

SAC '05: Proceedings of the 2005 ACM symposium on Applied computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Knowledge Discovery and Retrieval on World Wide Web Using Web Structure Mining

Effectual Web Content Mining using Noise Removal from Web Pages

Framework for mining web content outliers