Abstract
The usefulness of the results produced by data mining methods can be critically impaired by several factors such as (1) low quality of data, including errors due to contamination, or incompleteness due to limited bandwidth for data acquisition, and (2) inadequacy of the data model for capturing complex probabilistic relationships in data. Fortunately, a wide spectrum of applications exhibit strong dependencies between data samples. For example, the readings of nearby sensors are generally correlated, and proteins interact with each other when performing crucial functions. Therefore, dependencies among data can be successfully exploited to remedy the problems mentioned above. In this paper, we propose a unified approach to improving mining quality using Markov networks as the data model to exploit local dependencies. Belief propagation is used to efficiently compute the marginal or maximum posterior probabilities, so as to clean the data, to infer missing values, or to improve the mining results from a model that ignores these dependencies. To illustrate the benefits and great generality of the technique, we present its application to three challenging problems: (i) cost-efficient sensor probing, (ii) enhancing protein function predictions, and (iii) sequence data denoising.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bennett, K., Demiriz, A., Maclin, R.: Exploiting unlabeled data in ensemble methods. In: Proc. of the 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 289–296 (2002)
Cheng, J., Hatzis, C., Hayashi, H., Krogel, M.-A., Morishita, S., Page, D., Sese, J.: Kdd cup 2001 report. SIGKDD Explorations 3(2), 47–64 (2001)
Chu, F., Wang, Y., Zaniolo, C., Parker, D.S.: Improving mining quality by exploiting data dependency. Technical report, UCLA Computer Science (2005)
Deshpande, A., Guestrin, C., Madden, S., Hellerstein, J., Hong, W.: Model-driven data acquisition in sensor networks. In: Proc. of the 30th Int’l Conf. on Very Large Data Bases (VLDB 2004) (2004)
Guyon, I., Natic, N., Vapnik, V.: Discovering informative patterns and data cleansing, pp. 181–203. AAAI/MIT Press (1996)
Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. Science 220(4598) (1983)
McEliece, R., MacKay, D., Cheng, J.: Turbo decoding as an instance of pearl’s ‘belief propagation’ algorithm. IEEE J. on Selected Areas in Communication 16(2), 140–152 (1998)
Murphy, K., Weiss, Y., Jordan, M.: Loopy belief propagation for approximate inference: an empiricial study. In: Proc. Uncertainty in AI (1999)
University of Washington, http://www.jisao.washington.edu/data_sets/widmann/
Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, San Francisco (1988)
Peterson, C., Anderson, J.: A mean-field theory learning algorithm for neural networks. Complex Systems 1 (1987)
Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)
Schultz, R., Stevenson, R.: A bayesian approach to image expansion for improved definition. IEEE Trans. Image Processing 3(3), 233–242 (1994)
Yang, Y., Wu, X., Zhu, X.: Dealing with predictive-but-unpredictable attributes in noisy data sources. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 471–483. Springer, Heidelberg (2004)
Yedidia, J., Freeman, W., Weiss, Y.: Generalized belief propagation. In: Advances in Neural Information Processing Systems (NIPS), vol. 13, pp. 689–695 (2000)
Zhu, X., Wu, X., Chen, Q.: Eliminating class noise in large datasets. In: Proc. of the 20th Int’l Conf. Machine Learning (ICML 2003) (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chu, F., Wang, Y., Zaniolo, C., Parker, D.S. (2005). Improving Mining Quality by Exploiting Data Dependency. In: Ho, T.B., Cheung, D., Liu, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2005. Lecture Notes in Computer Science(), vol 3518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11430919_57
Download citation
DOI: https://doi.org/10.1007/11430919_57
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26076-9
Online ISBN: 978-3-540-31935-1
eBook Packages: Computer ScienceComputer Science (R0)