Discovering Classification from Data of Multiple Sources

Ling, Charles X.; Yang, Qiang

doi:10.1007/s10618-005-0013-7

Discovering Classification from Data of Multiple Sources

Original Paper
Published: 01 April 2006

Volume 12, pages 181–201, (2006)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Charles X. Ling¹ &
Qiang Yang²

316 Accesses
7 Citations
Explore all metrics

Abstract

In many large e-commerce organizations, multiple data sources are often used to describe the same customers, thus it is important to consolidate data of multiple sources for intelligent business decision making. In this paper, we propose a novel method that predicts the classification of data from multiple sources without class labels in each source. We test our method on artificial and real-world datasets, and show that it can classify the data accurately. From the machine learning perspective, our method removes the fundamental assumption of providing class labels in supervised learning, and bridges the gap between supervised and unsupervised learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

LIA: A Label-Independent Algorithm for Feature Selection for Supervised Learning

Clustering

Learning Interpretable Rules for Multi-Label Classification

Notes

The integer in parentheses (for example 384) means there are 384 instances in this leaf or cluster. All the trees in the paper are represented in the same format as the output of C4.5 (Quinlan, 1993).
Normally the partition trees are different from (and larger than) the ideal ones, as shown in later subsections on incomplete and noisy datasets.
This is suggested by Doug Fisher.
These two datasets have a large number of discrete attributes. Recall that CMS currently works only on discrete attributes.

References

Blum, A. and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100.
Cheeseman, P. and Stutz, J. 1996. Bayesian classification (AutoClass): Theory and results. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI Press/MIT Press.
Church, K.W. and Hanks, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of the 27th. Annual Meeting of the Association for Computational Linguistics, Vancouver, B.C. Association for Computational Linguistics, pp. 76–83.
de Sa, V. 1994a. Learning classification with unlabeled data. In Advances in Neural Information Processing Systems, J. Cowan, G. Tesauro, and J. Alspector (Eds.), vol. 6, pp. 112–119.
de Sa, V. 1994b. Minimizing disagreement for self-supervised classification. In Proceedings of the 1993 Connectionist Models Summer School, M. Mozer, P. Smolensky, D. Touretzky, and A. Weigend (Eds.), pp. 300–307.
de Sa, V. and Ballard, D. 1998. Category learning through multi-modality sensing. Neural Computation, 10(5).
Fisher, D. 1987. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139–172.
Google Scholar
Kohavi, R. and John, G. 1997. Wrappers for feature subset selection. Artificial Intelligence, 97(1–2):273–324.
Google Scholar
Lu, S. and Chen, K. 1987. A machine learning approach to the automatic synthesis of mechanistic knowledge for engineering decision-making. Artificial Intelligence for Engineering Design, Analysis, and Manufacturing, 1:109–118.
Google Scholar
Murphy, P.M. and Aha, D.W. 1992. UCI Repository of Machine Learning Databases [Machine-readable data repository]. Irvine, CA, University of California, Department of Information and Computer Science.
Nigam, K. and Ghani, R. 2000. Analyzing the effectiveness and applicability of co-training. In Proceedings of the Ninth International Conference on Information and Knowledge Management, pp. 86–93.
Quinlan, J. 1993. C4.5: Programs for Machine Learning. San Mateo, CA, Morgan Kaufmann.
Raskutti, B., Ferra, H., and Kowalczyk, A. 2002. Combining clustering and co-training to enhance text classification using unlabelled data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 620–625.
Reich, Y. 1992. Ecobweb: Preliminary user's manual. Tech. rep., Department of Civil Engineering, Carnegie Mellon University.
Reich, Y. and Fenves, S. 1991. The formation and use of abstract concepts in design. In Concept Formation: Knowledge and Experience in Unsupervised Learning, D. Fisher, M. Pazzani, and P. Langley (Eds.), Morgan Kaufmann, CA.
Reich, Y. and Fenves, S. 1992. Inductive learning of synthesis knowledge. International Journal of Expert Systems: Research and Applications, 5(4):275–297.
Google Scholar
Sinkkonen, J., Nikkil, J., Lahti, L., and Kaski, S. 2004. Associative clustering. In Proceedings of 15th European Conference on Machine Learning (ECML 2004), pp. 396–406.
Turney, P. (1993). Exploiting context when learning to classify. In Proceedings of ECML-93, pp. 402–407.
Wu, X. and Zhang, S. 2003. Synthesizing high-frequency rules from different data source. IEEE Transactions on Knowledge and Data Engineering, 15(2):353–367.
Google Scholar
Yao, Y., Chen, L., Goh, A., and Wong, A. 2002. Clustering gene data via associative clustering neural network. In Proceedings of the 9th International Conference on Neural Information Processing (ICONIP 2002), pp. 2228–2232.
Zhang, S., Wu, X., and Zhang, C. 2003. Multi-database mining. IEEE Computational Intelligence Bulletin, 2(1):5–13.
Google Scholar

Download references

Acknowledgments

We thank Doug Fisher and Joel Martin for their extensive and insightful comments and suggestions on the earlier versions of the paper. We also thank Chenghui Li for discussions and working with CMS. Qiang Yang thanks the support of Hong Kong RGC grant HKUST 6187/04E.

Author information

Authors and Affiliations

Department of Computer Science, University of Western Ontario, London, Ontario, N6A 5B7, Canada
Charles X. Ling
Department of Computer Science, Hong Kong UST, Kowloon, Hong Kong
Qiang Yang

Authors

Charles X. Ling
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Charles X. Ling.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ling, C.X., Yang, Q. Discovering Classification from Data of Multiple Sources. Data Min Knowl Disc 12, 181–201 (2006). https://doi.org/10.1007/s10618-005-0013-7

Download citation

Received: 03 April 2005
Accepted: 27 July 2005
Published: 01 April 2006
Issue Date: May 2006
DOI: https://doi.org/10.1007/s10618-005-0013-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Discovering Classification from Data of Multiple Sources

Abstract

Access this article

Similar content being viewed by others

LIA: A Label-Independent Algorithm for Feature Selection for Supervised Learning

Clustering

Learning Interpretable Rules for Multi-Label Classification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Discovering Classification from Data of Multiple Sources

Abstract

Access this article

Similar content being viewed by others

LIA: A Label-Independent Algorithm for Feature Selection for Supervised Learning

Clustering

Learning Interpretable Rules for Multi-Label Classification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation