A dynamic attribute-based data filtering and recovery scheme for web information processing

Ahuja, Amit; Ng, Yiu-Kai

doi:10.1007/s10115-008-0140-8

A dynamic attribute-based data filtering and recovery scheme for web information processing

Regular Paper
Published: 08 May 2008

Volume 18, pages 263–291, (2009)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Amit Ahuja¹ &
Yiu-Kai Ng¹

75 Accesses
3 Citations
3 Altmetric
Explore all metrics

Abstract

Web data being transmitted over a network channel on the Internet with excessive amount of data causes data processing problems, which include selectively choosing useful information to be retained for various data applications. In this paper, we present an approach for filtering less-informative attribute data from a source Website. A scheme for filtering attributes, instead of tuples (records), from a Website becomes imperative, since filtering a complete tuple would lead to filtering some informative, as well as less-informative, attribute data in the tuple. Since filtered data at the source Website may be of interest to the user at the destination Website, we design a data recovery approach that maintains the minimal amount of information for data recovery purpose while imposing minimal overhead for data recovery at the source Website. Our data filtering and recovery approach (1) handles a wide range of Web data in different application domains (such as weather, stock exchanges, Internet traffic, etc.), (2) is dynamic in nature, since each filtering scheme adjusts the amount of data to be filtered as needed, and (3) is adaptive, which is appealing in an ever-changing Internet environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ahuja A (2006) A dynamic attribute-based load shedding and data recovery scheme for data stream management systems. Master Thesis, Computer Science Department, Brigham Young University
Almuallim H, Dietterich T (1991) Learning with many irrelevant features. In: Proceedings of the ninth national conference on artificial intelligence. AAAI Press, Anaheim, pp 547–552
Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th international conference on very large databases, Hong Kong, 20–23 August 2002, pp 586–597
Babcock B, Datar M, Motwani R (2004) Load shedding for aggregation queries over data streams, In: Proceedings of the 20th international conference on data engineering, Bostern, 30 March–2 April 2004, pp 350–361
Bai Y, Wang F, Liu P (2006) Efficiently filtering RFID data streams. In: Proceedings of the first international VLDB workshop on clean databases (CleanDB’06), Seoul, Septermber 2006
Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: Proceedings of the 9th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, Paris, 13 June 2004, pp 11–18
Bilenko M, Mooney R (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the international conference on knowledge discover and data mining (KDD), Washigton, DC, August 2003, pp 39–48
Bouras C, Konidaris A (2005) Estimating and eliminating redundant data transfers over the web: a fragment based approach. Int J Commun Syst 18(2): 119–142
Article Google Scholar
Chandola V, Kumar V (2007) Summarization—compressing data into an informative representation. Knowl Inf Syst 12(3): 355–378
Article Google Scholar
Chaudhuri S, Ganti V, Motwani R (2005) Robust Identification of Fuzzy Duplicates. In: Proceedings of the 21st international conference on data engineering (ICDE’05), Tokyo, 5–8 April 2005, pp 865–876
Floratos A, Rigoutsos I, Parida L, Gao Y (2001) DELPHI: a pattern-based method for detecting sequence similarity. IBM J Res Deve 45(3/4): 455–474
Google Scholar
Garcia I, Ng Y-K (2006) Eliminating redundant and less-informative RSS news articles based on word similarity and a fuzzy equivalence relation. In: Proceedings of the 18th IEEE international conference on tools with artificial intelligence (ICTAI-2006), Washington, DC, 13–15 November 2006, pp 465–473
Hamming R (1950) Error detecting and error correcting codes. Bell Syst Tech J 29: 147–160
MathSciNet Google Scholar
Hernandez M, Stolfo S (1995) The merge/purge problem for large databases. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, San Jose, May 1995, pp 127–138
Kira K, Rendell L (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning (ML92), Aberdeen, 1–3 July 1992, pp 249–256
Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD 1997 workshop on research issues on data mining and knowledge discovery, Tucson, 11 May 1997, pp 23–29
Pantel P, Philpot A, Hovy E (2005) An information theoretic model for database alignment. In: Proceedings of the 17th international conference on scientific and statistical database management, Santa Barbara, CA, 27–29 June 2005 pp 14–23
Richardson R, Smeaton A, Murphy J (1994) Using WordNet as a knowledge base for measuring semantic similarity between words. In: Proceedings of AICS conference, Trinity College, Dubkin, Septermber 1994, pp 179–192
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD conference on knowledge discovery and data mining (KDD), Edmoton, Canada, 23–26 July 2002, pp 269–278
Shannon C (1949) The mathematical theory of information. University of Illinois Press, Urbana (Reprinted 1998)
Spiegel M, Schiller J, Srinivasan R (2000) Schaum’s outline of probability and statistics. McGraw-Hill, New York
Google Scholar
Tobita M, Horiuchi K, Araki K (2003) BirdsAnts: bringing informative rules from a database system, aimed at novel targets search. Genome Inf 14: 286–287
Google Scholar
Tsai T, Lee S (2003) Simsearcher: a local similarity search engine for biological sequence databases, In: Proceedings of the international symposium on multimedia software engineering (ISMSE’03), Taichung Taiwan, December 2003, pp 305–312
Van Hulse J, Khoshgoftaar T, Huang H (2007) The pairwise attribute noise detection algorithm. Knowl Inf Syst 11(2): 171–190
Article Google Scholar
Wei L, Keogh E, Van Herle H, Mafra-Neto A, Abbott R (2007) Efficient query filtering for streaming time series with applications to semisupervised learning of time series classifiers. Knowl Inf Syst 11(3): 313–344
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Brigham Young University, Provo, UT, 84602, USA
Amit Ahuja & Yiu-Kai Ng

Authors

Amit Ahuja
View author publications
You can also search for this author inPubMed Google Scholar
Yiu-Kai Ng
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Amit Ahuja.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ahuja, A., Ng, YK. A dynamic attribute-based data filtering and recovery scheme for web information processing. Knowl Inf Syst 18, 263–291 (2009). https://doi.org/10.1007/s10115-008-0140-8

Download citation

Received: 30 March 2007
Revised: 07 December 2007
Accepted: 15 March 2008
Published: 08 May 2008
Issue Date: March 2009
DOI: https://doi.org/10.1007/s10115-008-0140-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A dynamic attribute-based data filtering and recovery scheme for web information processing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Designing a Data Pipeline Architecture for Intelligent Analysis of Streaming Data

Web Data Conceptual Framework: Integration, Cleaning, Analysis, Visualization, and Security

Dynamic Web View Materialization

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

A dynamic attribute-based data filtering and recovery scheme for web information processing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Designing a Data Pipeline Architecture for Intelligent Analysis of Streaming Data

Web Data Conceptual Framework: Integration, Cleaning, Analysis, Visualization, and Security

Dynamic Web View Materialization

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now