skip to main content
10.1145/1871437.1871547acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Pattern discovery for large mixed-mode database

Authors Info & Claims
Published:26 October 2010Publication History

ABSTRACT

In business and industry today, large databases with mixed data types (continuous and categorical) are very common. There are great needs to discover patterns from them for knowledge interpretation and understanding. In the past, for classification, this problem is solved as a discrete data problem by first discretizing the continuous data based on the class-attribute interdependence relationship. However, so far no proper solution exists when class information is unavailable. Hence, important pattern post-processing tasks such as pattern clustering and summarization cannot be applied to mixed-mode data. This paper presents a new method for solving the problem. It is based on two essential concepts. (1) Though class information is absent, yet for a correlated dataset, the attribute with the strongest interdependence with others in the group can be used to drive the discretization of the continuous data. (2) For a large database, correlated attribute groups must first be obtained by attribute clustering before (1) can be applied. Based on (1) and (2), pattern discovery methods are developed for mixed-mode data. Extensive experiments using synthetic and real world data were conducted to validate the usefulness and effectiveness of the proposed method.

References

  1. Agrawal, R., Ghost, S., Imielinski, T., Iyer, B., and Swami, A. 1992. An interval classifier for database mining applications. In Proc. Int. Conf Very L. 560--573. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine, A. J. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. In Proc. of the National Academy of Sciences of the United States of America. 96, 12, 6745--6750.Google ScholarGoogle ScholarCross RefCross Ref
  3. Asuncion, A., and Newman, D. J. 2007. UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. DOI= http://archive.ics.uci.edu/ml/.Google ScholarGoogle Scholar
  4. Au, W. H., Chan, K. C. C., and Yao, X. 2003. A novel evolutionary data mining algorithm with applications to churn prediction. IEEE T. Evolut. Comput. 7, 6 (Dec. 2003), 532--545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Au, W. H., Chan, K. C. C., Wong, A. K. C., and Wang, Y. 2005. Attribute clustering for grouping, selection, and classification of gene expression data. IEEE-ACM T. Comput. Bi. 2, 2 (Apr. 2005), 83--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., and Yakhini, Z. 2000. Tissue classification with gene expression profiles. In Proceedings of the Fourth Annual International Conference on Computational Molecular Biology. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chau, T., and Wong, A. K. C. 1999. Pattern discovery by residual analysis and recursive partitioning. IEEE T. Knowl. Data. En. 11, 6 (Nov. 1999), 833--854. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ching, J. Y., Wong, A. K. C., and Chan, K. C. C. 1995. Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE T. Pattern Anal. 17, 7(Jul. 1995), 631--641. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chiu, D., Wong, A. C. K., and Cheung, B. 1990. Information discovery through hierarchical maximum entropy. J. Exp. Theor. Artif. In. 2, 117--129.Google ScholarGoogle ScholarCross RefCross Ref
  10. Ho, K. M., and Scott, P. D. 1997. Zeta: A global method for discretization of continuous variables. In D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors, Knowledge Discovery and Data Mining, AAAI Press. 191--194.Google ScholarGoogle Scholar
  11. Kohavi, R., John, G., Long, R., Manley, D., and Pfleger, K. 1994. Mlc++: a machine learning library in c. In Proc Int. C Tools Art.Google ScholarGoogle Scholar
  12. Kurgan, L., and Cios, K. J. 2001. Discretization algorithm that uses class-attribute interdependence maximization. In Proceedings of the 2001 International Conference on Artificial Intelligence (IC-AI 2001), 980--987.Google ScholarGoogle Scholar
  13. Liu, H., Hussain, F., Tan, C. L., and Dash, M. 2002. Discretization: an enabling technique. Data Min. Knowl. Disc. 6, 4 (Oct. 2002), 393--423. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Liu, L., Wong, A. K. C., and Wang, Y. 2004. A global optimal algorithm for class-dependent discretization of continuous data. Intell. Data Anal. 8, 2, 151--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kauffman, San, Mateo CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Wang, C. C., and Wong, A. K. C. 1979. Classification of discrete-valued data with feature space transformation. IEEE T. Automat. Contr. 24, 3, 434--437.Google ScholarGoogle ScholarCross RefCross Ref
  17. Wang, Y., and Wong, A. K. C. 2010. Discover*e. Pattern Discovery Technologies. DOI= http://www.patterndiscovery.com.Google ScholarGoogle Scholar
  18. Wang, Y., and Wong, A. K. C. 2003. From association to classification: inference using weight of evidence. IEEE T. Knowl. Data. En. 15, 3, 914--925, 200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Wong, A. K. C., and Wang, Y. 2003. Pattern discovery: a data driven approach to decision support. IEEE T Syst. Man Cy. C, 33, 1, 114--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Wong, A. K. C., and Chiu, D. K. Y. 1987. Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE T. Pattern Anal. 9, 8 (Nov. 1987), 796--805. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Wong, A. K. C., and Liu, T. S. 1975. Typicality, diversity and feature patterns of an ensemble. IEEE T. Comput. 24, 2 (Feb. 1975), 158--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Wong, A. K. C., and Wang, Y. 1997. High order pattern discovery from discrete-valued data. IEEE T. Knowl. Data. En. 9, 6 (Nov. 1997), 877--893. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Wong, A. K. C., Chiu D. K. Y., and Huang, W. 2001. A discrete-valued clustering algorithm with applications to bimolecular data. Information Sciences, 139, 1--2 (Nov. 2001), 97--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Wong, A. K. C., Liu, T. S., and Wang, C. C. 1976. Statistical analysis of residue variability in Cytochrome C. J. Mol. Biol. 102, 2(Apr. 1976), 287--295.Google ScholarGoogle ScholarCross RefCross Ref
  25. Wong, A. K. C., and Li, G. C. L. 2008. Simultaneous pattern and data clustering for pattern cluster analysis. IEEE T. Knowl. Data. En. 20, 7 (Jul. 2008), 911--923. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Wong, A. K. C., and Li, G. C. L. 2010. Association pattern analysis for pattern pruning, pattern clustering and summarization, to appear in Journal of Knowledge and Information Systems, 2010.Google ScholarGoogle Scholar

Index Terms

  1. Pattern discovery for large mixed-mode database

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
        October 2010
        2036 pages
        ISBN:9781450300995
        DOI:10.1145/1871437

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 October 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader