Abstract
Web data being transmitted over a network channel on the Internet with excessive amount of data causes data processing problems, which include selectively choosing useful information to be retained for various data applications. In this paper, we present an approach for filtering less-informative attribute data from a source Website. A scheme for filtering attributes, instead of tuples (records), from a Website becomes imperative, since filtering a complete tuple would lead to filtering some informative, as well as less-informative, attribute data in the tuple. Since filtered data at the source Website may be of interest to the user at the destination Website, we design a data recovery approach that maintains the minimal amount of information for data recovery purpose while imposing minimal overhead for data recovery at the source Website. Our data filtering and recovery approach (1) handles a wide range of Web data in different application domains (such as weather, stock exchanges, Internet traffic, etc.), (2) is dynamic in nature, since each filtering scheme adjusts the amount of data to be filtered as needed, and (3) is adaptive, which is appealing in an ever-changing Internet environment.
Similar content being viewed by others
References
Ahuja A (2006) A dynamic attribute-based load shedding and data recovery scheme for data stream management systems. Master Thesis, Computer Science Department, Brigham Young University
Almuallim H, Dietterich T (1991) Learning with many irrelevant features. In: Proceedings of the ninth national conference on artificial intelligence. AAAI Press, Anaheim, pp 547–552
Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th international conference on very large databases, Hong Kong, 20–23 August 2002, pp 586–597
Babcock B, Datar M, Motwani R (2004) Load shedding for aggregation queries over data streams, In: Proceedings of the 20th international conference on data engineering, Bostern, 30 March–2 April 2004, pp 350–361
Bai Y, Wang F, Liu P (2006) Efficiently filtering RFID data streams. In: Proceedings of the first international VLDB workshop on clean databases (CleanDB’06), Seoul, Septermber 2006
Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: Proceedings of the 9th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, Paris, 13 June 2004, pp 11–18
Bilenko M, Mooney R (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the international conference on knowledge discover and data mining (KDD), Washigton, DC, August 2003, pp 39–48
Bouras C, Konidaris A (2005) Estimating and eliminating redundant data transfers over the web: a fragment based approach. Int J Commun Syst 18(2): 119–142
Chandola V, Kumar V (2007) Summarization—compressing data into an informative representation. Knowl Inf Syst 12(3): 355–378
Chaudhuri S, Ganti V, Motwani R (2005) Robust Identification of Fuzzy Duplicates. In: Proceedings of the 21st international conference on data engineering (ICDE’05), Tokyo, 5–8 April 2005, pp 865–876
Floratos A, Rigoutsos I, Parida L, Gao Y (2001) DELPHI: a pattern-based method for detecting sequence similarity. IBM J Res Deve 45(3/4): 455–474
Garcia I, Ng Y-K (2006) Eliminating redundant and less-informative RSS news articles based on word similarity and a fuzzy equivalence relation. In: Proceedings of the 18th IEEE international conference on tools with artificial intelligence (ICTAI-2006), Washington, DC, 13–15 November 2006, pp 465–473
Hamming R (1950) Error detecting and error correcting codes. Bell Syst Tech J 29: 147–160
Hernandez M, Stolfo S (1995) The merge/purge problem for large databases. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, San Jose, May 1995, pp 127–138
Kira K, Rendell L (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning (ML92), Aberdeen, 1–3 July 1992, pp 249–256
Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD 1997 workshop on research issues on data mining and knowledge discovery, Tucson, 11 May 1997, pp 23–29
Pantel P, Philpot A, Hovy E (2005) An information theoretic model for database alignment. In: Proceedings of the 17th international conference on scientific and statistical database management, Santa Barbara, CA, 27–29 June 2005 pp 14–23
Richardson R, Smeaton A, Murphy J (1994) Using WordNet as a knowledge base for measuring semantic similarity between words. In: Proceedings of AICS conference, Trinity College, Dubkin, Septermber 1994, pp 179–192
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD conference on knowledge discovery and data mining (KDD), Edmoton, Canada, 23–26 July 2002, pp 269–278
Shannon C (1949) The mathematical theory of information. University of Illinois Press, Urbana (Reprinted 1998)
Spiegel M, Schiller J, Srinivasan R (2000) Schaum’s outline of probability and statistics. McGraw-Hill, New York
Tobita M, Horiuchi K, Araki K (2003) BirdsAnts: bringing informative rules from a database system, aimed at novel targets search. Genome Inf 14: 286–287
Tsai T, Lee S (2003) Simsearcher: a local similarity search engine for biological sequence databases, In: Proceedings of the international symposium on multimedia software engineering (ISMSE’03), Taichung Taiwan, December 2003, pp 305–312
Van Hulse J, Khoshgoftaar T, Huang H (2007) The pairwise attribute noise detection algorithm. Knowl Inf Syst 11(2): 171–190
Wei L, Keogh E, Van Herle H, Mafra-Neto A, Abbott R (2007) Efficient query filtering for streaming time series with applications to semisupervised learning of time series classifiers. Knowl Inf Syst 11(3): 313–344
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ahuja, A., Ng, YK. A dynamic attribute-based data filtering and recovery scheme for web information processing. Knowl Inf Syst 18, 263–291 (2009). https://doi.org/10.1007/s10115-008-0140-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0140-8