Skip to main content

Improving Mining Quality by Exploiting Data Dependency

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3518))

Included in the following conference series:

Abstract

The usefulness of the results produced by data mining methods can be critically impaired by several factors such as (1) low quality of data, including errors due to contamination, or incompleteness due to limited bandwidth for data acquisition, and (2) inadequacy of the data model for capturing complex probabilistic relationships in data. Fortunately, a wide spectrum of applications exhibit strong dependencies between data samples. For example, the readings of nearby sensors are generally correlated, and proteins interact with each other when performing crucial functions. Therefore, dependencies among data can be successfully exploited to remedy the problems mentioned above. In this paper, we propose a unified approach to improving mining quality using Markov networks as the data model to exploit local dependencies. Belief propagation is used to efficiently compute the marginal or maximum posterior probabilities, so as to clean the data, to infer missing values, or to improve the mining results from a model that ignores these dependencies. To illustrate the benefits and great generality of the technique, we present its application to three challenging problems: (i) cost-efficient sensor probing, (ii) enhancing protein function predictions, and (iii) sequence data denoising.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Bennett, K., Demiriz, A., Maclin, R.: Exploiting unlabeled data in ensemble methods. In: Proc. of the 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 289–296 (2002)

    Google Scholar 

  2. Cheng, J., Hatzis, C., Hayashi, H., Krogel, M.-A., Morishita, S., Page, D., Sese, J.: Kdd cup 2001 report. SIGKDD Explorations 3(2), 47–64 (2001)

    Article  Google Scholar 

  3. Chu, F., Wang, Y., Zaniolo, C., Parker, D.S.: Improving mining quality by exploiting data dependency. Technical report, UCLA Computer Science (2005)

    Google Scholar 

  4. Deshpande, A., Guestrin, C., Madden, S., Hellerstein, J., Hong, W.: Model-driven data acquisition in sensor networks. In: Proc. of the 30th Int’l Conf. on Very Large Data Bases (VLDB 2004) (2004)

    Google Scholar 

  5. Guyon, I., Natic, N., Vapnik, V.: Discovering informative patterns and data cleansing, pp. 181–203. AAAI/MIT Press (1996)

    Google Scholar 

  6. Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. Science 220(4598) (1983)

    Google Scholar 

  7. McEliece, R., MacKay, D., Cheng, J.: Turbo decoding as an instance of pearl’s ‘belief propagation’ algorithm. IEEE J. on Selected Areas in Communication 16(2), 140–152 (1998)

    Article  Google Scholar 

  8. Murphy, K., Weiss, Y., Jordan, M.: Loopy belief propagation for approximate inference: an empiricial study. In: Proc. Uncertainty in AI (1999)

    Google Scholar 

  9. University of Washington, http://www.jisao.washington.edu/data_sets/widmann/

  10. Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, San Francisco (1988)

    Google Scholar 

  11. Peterson, C., Anderson, J.: A mean-field theory learning algorithm for neural networks. Complex Systems 1 (1987)

    Google Scholar 

  12. Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  13. Schultz, R., Stevenson, R.: A bayesian approach to image expansion for improved definition. IEEE Trans. Image Processing 3(3), 233–242 (1994)

    Article  Google Scholar 

  14. Yang, Y., Wu, X., Zhu, X.: Dealing with predictive-but-unpredictable attributes in noisy data sources. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 471–483. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  15. Yedidia, J., Freeman, W., Weiss, Y.: Generalized belief propagation. In: Advances in Neural Information Processing Systems (NIPS), vol. 13, pp. 689–695 (2000)

    Google Scholar 

  16. Zhu, X., Wu, X., Chen, Q.: Eliminating class noise in large datasets. In: Proc. of the 20th Int’l Conf. Machine Learning (ICML 2003) (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chu, F., Wang, Y., Zaniolo, C., Parker, D.S. (2005). Improving Mining Quality by Exploiting Data Dependency. In: Ho, T.B., Cheung, D., Liu, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2005. Lecture Notes in Computer Science(), vol 3518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11430919_57

Download citation

  • DOI: https://doi.org/10.1007/11430919_57

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26076-9

  • Online ISBN: 978-3-540-31935-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics