Improving Mining Quality by Exploiting Data Dependency

Chu, Fang; Wang, Yizhou; Zaniolo, Carlo; Parker, D. Stott

doi:10.1007/11430919_57

Fang Chu²¹,
Yizhou Wang²¹,
Carlo Zaniolo²¹ &
…
D. Stott Parker²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3518))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2680 Accesses
1 Citations

Abstract

The usefulness of the results produced by data mining methods can be critically impaired by several factors such as (1) low quality of data, including errors due to contamination, or incompleteness due to limited bandwidth for data acquisition, and (2) inadequacy of the data model for capturing complex probabilistic relationships in data. Fortunately, a wide spectrum of applications exhibit strong dependencies between data samples. For example, the readings of nearby sensors are generally correlated, and proteins interact with each other when performing crucial functions. Therefore, dependencies among data can be successfully exploited to remedy the problems mentioned above. In this paper, we propose a unified approach to improving mining quality using Markov networks as the data model to exploit local dependencies. Belief propagation is used to efficiently compute the marginal or maximum posterior probabilities, so as to clean the data, to infer missing values, or to improve the mining results from a model that ignores these dependencies. To illustrate the benefits and great generality of the technique, we present its application to three challenging problems: (i) cost-efficient sensor probing, (ii) enhancing protein function predictions, and (iii) sequence data denoising.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Data Analysis

Merits of Bayesian networks in overcoming small data challenges: a meta-model for handling missing data

Article 27 June 2022

References

Bennett, K., Demiriz, A., Maclin, R.: Exploiting unlabeled data in ensemble methods. In: Proc. of the 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 289–296 (2002)
Google Scholar
Cheng, J., Hatzis, C., Hayashi, H., Krogel, M.-A., Morishita, S., Page, D., Sese, J.: Kdd cup 2001 report. SIGKDD Explorations 3(2), 47–64 (2001)
Article Google Scholar
Chu, F., Wang, Y., Zaniolo, C., Parker, D.S.: Improving mining quality by exploiting data dependency. Technical report, UCLA Computer Science (2005)
Google Scholar
Deshpande, A., Guestrin, C., Madden, S., Hellerstein, J., Hong, W.: Model-driven data acquisition in sensor networks. In: Proc. of the 30th Int’l Conf. on Very Large Data Bases (VLDB 2004) (2004)
Google Scholar
Guyon, I., Natic, N., Vapnik, V.: Discovering informative patterns and data cleansing, pp. 181–203. AAAI/MIT Press (1996)
Google Scholar
Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. Science 220(4598) (1983)
Google Scholar
McEliece, R., MacKay, D., Cheng, J.: Turbo decoding as an instance of pearl’s ‘belief propagation’ algorithm. IEEE J. on Selected Areas in Communication 16(2), 140–152 (1998)
Article Google Scholar
Murphy, K., Weiss, Y., Jordan, M.: Loopy belief propagation for approximate inference: an empiricial study. In: Proc. Uncertainty in AI (1999)
Google Scholar
University of Washington, http://www.jisao.washington.edu/data_sets/widmann/
Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, San Francisco (1988)
Google Scholar
Peterson, C., Anderson, J.: A mean-field theory learning algorithm for neural networks. Complex Systems 1 (1987)
Google Scholar
Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Schultz, R., Stevenson, R.: A bayesian approach to image expansion for improved definition. IEEE Trans. Image Processing 3(3), 233–242 (1994)
Article Google Scholar
Yang, Y., Wu, X., Zhu, X.: Dealing with predictive-but-unpredictable attributes in noisy data sources. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 471–483. Springer, Heidelberg (2004)
Chapter Google Scholar
Yedidia, J., Freeman, W., Weiss, Y.: Generalized belief propagation. In: Advances in Neural Information Processing Systems (NIPS), vol. 13, pp. 689–695 (2000)
Google Scholar
Zhu, X., Wu, X., Chen, Q.: Eliminating class noise in large datasets. In: Proc. of the 20th Int’l Conf. Machine Learning (ICML 2003) (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

University of California, Los Angeles, CA, 90095, USA
Fang Chu, Yizhou Wang, Carlo Zaniolo & D. Stott Parker

Authors

Fang Chu
View author publications
You can also search for this author in PubMed Google Scholar
Yizhou Wang
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Zaniolo
View author publications
You can also search for this author in PubMed Google Scholar
D. Stott Parker
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Japan Advanced Institute of Science and Technology, Asahidai 1-1, 923-12292, Nomi, Japan
Tu Bao Ho
University of Hong Kong, Pokfulam Road, Hong Kong, China
David Cheung
Department of Computer Science and Engineering, Arizona State University, Tempe, Arizona, USA
Huan Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chu, F., Wang, Y., Zaniolo, C., Parker, D.S. (2005). Improving Mining Quality by Exploiting Data Dependency. In: Ho, T.B., Cheung, D., Liu, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2005. Lecture Notes in Computer Science(), vol 3518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11430919_57

Download citation

DOI: https://doi.org/10.1007/11430919_57
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26076-9
Online ISBN: 978-3-540-31935-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Mining Quality by Exploiting Data Dependency

Abstract

Access this chapter

Preview

Similar content being viewed by others

Data Analysis

Data Analysis

Merits of Bayesian networks in overcoming small data challenges: a meta-model for handling missing data

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us