Skip to main content
Log in

A dynamic attribute-based data filtering and recovery scheme for web information processing

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Web data being transmitted over a network channel on the Internet with excessive amount of data causes data processing problems, which include selectively choosing useful information to be retained for various data applications. In this paper, we present an approach for filtering less-informative attribute data from a source Website. A scheme for filtering attributes, instead of tuples (records), from a Website becomes imperative, since filtering a complete tuple would lead to filtering some informative, as well as less-informative, attribute data in the tuple. Since filtered data at the source Website may be of interest to the user at the destination Website, we design a data recovery approach that maintains the minimal amount of information for data recovery purpose while imposing minimal overhead for data recovery at the source Website. Our data filtering and recovery approach (1) handles a wide range of Web data in different application domains (such as weather, stock exchanges, Internet traffic, etc.), (2) is dynamic in nature, since each filtering scheme adjusts the amount of data to be filtered as needed, and (3) is adaptive, which is appealing in an ever-changing Internet environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ahuja A (2006) A dynamic attribute-based load shedding and data recovery scheme for data stream management systems. Master Thesis, Computer Science Department, Brigham Young University

  2. Almuallim H, Dietterich T (1991) Learning with many irrelevant features. In: Proceedings of the ninth national conference on artificial intelligence. AAAI Press, Anaheim, pp 547–552

  3. Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th international conference on very large databases, Hong Kong, 20–23 August 2002, pp 586–597

  4. Babcock B, Datar M, Motwani R (2004) Load shedding for aggregation queries over data streams, In: Proceedings of the 20th international conference on data engineering, Bostern, 30 March–2 April 2004, pp 350–361

  5. Bai Y, Wang F, Liu P (2006) Efficiently filtering RFID data streams. In: Proceedings of the first international VLDB workshop on clean databases (CleanDB’06), Seoul, Septermber 2006

  6. Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: Proceedings of the 9th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, Paris, 13 June 2004, pp 11–18

  7. Bilenko M, Mooney R (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the international conference on knowledge discover and data mining (KDD), Washigton, DC, August 2003, pp 39–48

  8. Bouras C, Konidaris A (2005) Estimating and eliminating redundant data transfers over the web: a fragment based approach. Int J Commun Syst 18(2): 119–142

    Article  Google Scholar 

  9. Chandola V, Kumar V (2007) Summarization—compressing data into an informative representation. Knowl Inf Syst 12(3): 355–378

    Article  Google Scholar 

  10. Chaudhuri S, Ganti V, Motwani R (2005) Robust Identification of Fuzzy Duplicates. In: Proceedings of the 21st international conference on data engineering (ICDE’05), Tokyo, 5–8 April 2005, pp 865–876

  11. Floratos A, Rigoutsos I, Parida L, Gao Y (2001) DELPHI: a pattern-based method for detecting sequence similarity. IBM J Res Deve 45(3/4): 455–474

    Google Scholar 

  12. Garcia I, Ng Y-K (2006) Eliminating redundant and less-informative RSS news articles based on word similarity and a fuzzy equivalence relation. In: Proceedings of the 18th IEEE international conference on tools with artificial intelligence (ICTAI-2006), Washington, DC, 13–15 November 2006, pp 465–473

  13. Hamming R (1950) Error detecting and error correcting codes. Bell Syst Tech J 29: 147–160

    MathSciNet  Google Scholar 

  14. Hernandez M, Stolfo S (1995) The merge/purge problem for large databases. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, San Jose, May 1995, pp 127–138

  15. Kira K, Rendell L (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning (ML92), Aberdeen, 1–3 July 1992, pp 249–256

  16. Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD 1997 workshop on research issues on data mining and knowledge discovery, Tucson, 11 May 1997, pp 23–29

  17. Pantel P, Philpot A, Hovy E (2005) An information theoretic model for database alignment. In: Proceedings of the 17th international conference on scientific and statistical database management, Santa Barbara, CA, 27–29 June 2005 pp 14–23

  18. Richardson R, Smeaton A, Murphy J (1994) Using WordNet as a knowledge base for measuring semantic similarity between words. In: Proceedings of AICS conference, Trinity College, Dubkin, Septermber 1994, pp 179–192

  19. Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD conference on knowledge discovery and data mining (KDD), Edmoton, Canada, 23–26 July 2002, pp 269–278

  20. Shannon C (1949) The mathematical theory of information. University of Illinois Press, Urbana (Reprinted 1998)

  21. Spiegel M, Schiller J, Srinivasan R (2000) Schaum’s outline of probability and statistics. McGraw-Hill, New York

    Google Scholar 

  22. Tobita M, Horiuchi K, Araki K (2003) BirdsAnts: bringing informative rules from a database system, aimed at novel targets search. Genome Inf 14: 286–287

    Google Scholar 

  23. Tsai T, Lee S (2003) Simsearcher: a local similarity search engine for biological sequence databases, In: Proceedings of the international symposium on multimedia software engineering (ISMSE’03), Taichung Taiwan, December 2003, pp 305–312

  24. Van Hulse J, Khoshgoftaar T, Huang H (2007) The pairwise attribute noise detection algorithm. Knowl Inf Syst 11(2): 171–190

    Article  Google Scholar 

  25. Wei L, Keogh E, Van Herle H, Mafra-Neto A, Abbott R (2007) Efficient query filtering for streaming time series with applications to semisupervised learning of time series classifiers. Knowl Inf Syst 11(3): 313–344

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amit Ahuja.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ahuja, A., Ng, YK. A dynamic attribute-based data filtering and recovery scheme for web information processing. Knowl Inf Syst 18, 263–291 (2009). https://doi.org/10.1007/s10115-008-0140-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-008-0140-8

Keywords