Skip to main content
Log in

Discovering Classification from Data of Multiple Sources

  • Original Paper
  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

In many large e-commerce organizations, multiple data sources are often used to describe the same customers, thus it is important to consolidate data of multiple sources for intelligent business decision making. In this paper, we propose a novel method that predicts the classification of data from multiple sources without class labels in each source. We test our method on artificial and real-world datasets, and show that it can classify the data accurately. From the machine learning perspective, our method removes the fundamental assumption of providing class labels in supervised learning, and bridges the gap between supervised and unsupervised learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1.
Figure 2.
Figure 3.

Similar content being viewed by others

Notes

  1. The integer in parentheses (for example 384) means there are 384 instances in this leaf or cluster. All the trees in the paper are represented in the same format as the output of C4.5 (Quinlan, 1993).

  2. Normally the partition trees are different from (and larger than) the ideal ones, as shown in later subsections on incomplete and noisy datasets.

  3. This is suggested by Doug Fisher.

  4. These two datasets have a large number of discrete attributes. Recall that CMS currently works only on discrete attributes.

References

  • Blum, A. and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100.

  • Cheeseman, P. and Stutz, J. 1996. Bayesian classification (AutoClass): Theory and results. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI Press/MIT Press.

  • Church, K.W. and Hanks, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of the 27th. Annual Meeting of the Association for Computational Linguistics, Vancouver, B.C. Association for Computational Linguistics, pp. 76–83.

  • de Sa, V. 1994a. Learning classification with unlabeled data. In Advances in Neural Information Processing Systems, J. Cowan, G. Tesauro, and J. Alspector (Eds.), vol. 6, pp. 112–119.

  • de Sa, V. 1994b. Minimizing disagreement for self-supervised classification. In Proceedings of the 1993 Connectionist Models Summer School, M. Mozer, P. Smolensky, D. Touretzky, and A. Weigend (Eds.), pp. 300–307.

  • de Sa, V. and Ballard, D. 1998. Category learning through multi-modality sensing. Neural Computation, 10(5).

  • Fisher, D. 1987. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139–172.

    Google Scholar 

  • Kohavi, R. and John, G. 1997. Wrappers for feature subset selection. Artificial Intelligence, 97(1–2):273–324.

    Google Scholar 

  • Lu, S. and Chen, K. 1987. A machine learning approach to the automatic synthesis of mechanistic knowledge for engineering decision-making. Artificial Intelligence for Engineering Design, Analysis, and Manufacturing, 1:109–118.

    Google Scholar 

  • Murphy, P.M. and Aha, D.W. 1992. UCI Repository of Machine Learning Databases [Machine-readable data repository]. Irvine, CA, University of California, Department of Information and Computer Science.

  • Nigam, K. and Ghani, R. 2000. Analyzing the effectiveness and applicability of co-training. In Proceedings of the Ninth International Conference on Information and Knowledge Management, pp. 86–93.

  • Quinlan, J. 1993. C4.5: Programs for Machine Learning. San Mateo, CA, Morgan Kaufmann.

  • Raskutti, B., Ferra, H., and Kowalczyk, A. 2002. Combining clustering and co-training to enhance text classification using unlabelled data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 620–625.

  • Reich, Y. 1992. Ecobweb: Preliminary user's manual. Tech. rep., Department of Civil Engineering, Carnegie Mellon University.

  • Reich, Y. and Fenves, S. 1991. The formation and use of abstract concepts in design. In Concept Formation: Knowledge and Experience in Unsupervised Learning, D. Fisher, M. Pazzani, and P. Langley (Eds.), Morgan Kaufmann, CA.

  • Reich, Y. and Fenves, S. 1992. Inductive learning of synthesis knowledge. International Journal of Expert Systems: Research and Applications, 5(4):275–297.

    Google Scholar 

  • Sinkkonen, J., Nikkil, J., Lahti, L., and Kaski, S. 2004. Associative clustering. In Proceedings of 15th European Conference on Machine Learning (ECML 2004), pp. 396–406.

  • Turney, P. (1993). Exploiting context when learning to classify. In Proceedings of ECML-93, pp. 402–407.

  • Wu, X. and Zhang, S. 2003. Synthesizing high-frequency rules from different data source. IEEE Transactions on Knowledge and Data Engineering, 15(2):353–367.

    Google Scholar 

  • Yao, Y., Chen, L., Goh, A., and Wong, A. 2002. Clustering gene data via associative clustering neural network. In Proceedings of the 9th International Conference on Neural Information Processing (ICONIP 2002), pp. 2228–2232.

  • Zhang, S., Wu, X., and Zhang, C. 2003. Multi-database mining. IEEE Computational Intelligence Bulletin, 2(1):5–13.

    Google Scholar 

Download references

Acknowledgments

We thank Doug Fisher and Joel Martin for their extensive and insightful comments and suggestions on the earlier versions of the paper. We also thank Chenghui Li for discussions and working with CMS. Qiang Yang thanks the support of Hong Kong RGC grant HKUST 6187/04E.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charles X. Ling.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ling, C.X., Yang, Q. Discovering Classification from Data of Multiple Sources. Data Min Knowl Disc 12, 181–201 (2006). https://doi.org/10.1007/s10618-005-0013-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-005-0013-7

Keywords

Navigation