research-article

Pattern discovery for large mixed-mode database

Authors:
Andrew K.C. Wong

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada
View Profile

,
Bin Wu

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada
View Profile

,
Gene P.K. Wu

The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong

The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
View Profile

,
Keith C.C. Chan

The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong

The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
View Profile

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementOctober 2010Pages 859–868https://doi.org/10.1145/1871437.1871547

Published:26 October 2010Publication History

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Pages 859–868

ABSTRACT

In business and industry today, large databases with mixed data types (continuous and categorical) are very common. There are great needs to discover patterns from them for knowledge interpretation and understanding. In the past, for classification, this problem is solved as a discrete data problem by first discretizing the continuous data based on the class-attribute interdependence relationship. However, so far no proper solution exists when class information is unavailable. Hence, important pattern post-processing tasks such as pattern clustering and summarization cannot be applied to mixed-mode data. This paper presents a new method for solving the problem. It is based on two essential concepts. (1) Though class information is absent, yet for a correlated dataset, the attribute with the strongest interdependence with others in the group can be used to drive the discretization of the continuous data. (2) For a large database, correlated attribute groups must first be obtained by attribute clustering before (1) can be applied. Based on (1) and (2), pattern discovery methods are developed for mixed-mode data. Extensive experiments using synthetic and real world data were conducted to validate the usefulness and effectiveness of the proposed method.

References

Agrawal, R., Ghost, S., Imielinski, T., Iyer, B., and Swami, A. 1992. An interval classifier for database mining applications. In Proc. Int. Conf Very L. 560--573. Google ScholarDigital Library
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine, A. J. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. In Proc. of the National Academy of Sciences of the United States of America. 96, 12, 6745--6750.Google ScholarCross Ref
Asuncion, A., and Newman, D. J. 2007. UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. DOI= http://archive.ics.uci.edu/ml/.Google Scholar
Au, W. H., Chan, K. C. C., and Yao, X. 2003. A novel evolutionary data mining algorithm with applications to churn prediction. IEEE T. Evolut. Comput. 7, 6 (Dec. 2003), 532--545. Google ScholarDigital Library
Au, W. H., Chan, K. C. C., Wong, A. K. C., and Wang, Y. 2005. Attribute clustering for grouping, selection, and classification of gene expression data. IEEE-ACM T. Comput. Bi. 2, 2 (Apr. 2005), 83--101. Google ScholarDigital Library
Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., and Yakhini, Z. 2000. Tissue classification with gene expression profiles. In Proceedings of the Fourth Annual International Conference on Computational Molecular Biology. Google ScholarDigital Library
Chau, T., and Wong, A. K. C. 1999. Pattern discovery by residual analysis and recursive partitioning. IEEE T. Knowl. Data. En. 11, 6 (Nov. 1999), 833--854. Google ScholarDigital Library
Ching, J. Y., Wong, A. K. C., and Chan, K. C. C. 1995. Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE T. Pattern Anal. 17, 7(Jul. 1995), 631--641. Google ScholarDigital Library
Chiu, D., Wong, A. C. K., and Cheung, B. 1990. Information discovery through hierarchical maximum entropy. J. Exp. Theor. Artif. In. 2, 117--129.Google ScholarCross Ref
Ho, K. M., and Scott, P. D. 1997. Zeta: A global method for discretization of continuous variables. In D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors, Knowledge Discovery and Data Mining, AAAI Press. 191--194.Google Scholar
Kohavi, R., John, G., Long, R., Manley, D., and Pfleger, K. 1994. Mlc++: a machine learning library in c. In Proc Int. C Tools Art.Google Scholar
Kurgan, L., and Cios, K. J. 2001. Discretization algorithm that uses class-attribute interdependence maximization. In Proceedings of the 2001 International Conference on Artificial Intelligence (IC-AI 2001), 980--987.Google Scholar
Liu, H., Hussain, F., Tan, C. L., and Dash, M. 2002. Discretization: an enabling technique. Data Min. Knowl. Disc. 6, 4 (Oct. 2002), 393--423. Google ScholarDigital Library
Liu, L., Wong, A. K. C., and Wang, Y. 2004. A global optimal algorithm for class-dependent discretization of continuous data. Intell. Data Anal. 8, 2, 151--170. Google ScholarDigital Library
Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kauffman, San, Mateo CA. Google ScholarDigital Library
Wang, C. C., and Wong, A. K. C. 1979. Classification of discrete-valued data with feature space transformation. IEEE T. Automat. Contr. 24, 3, 434--437.Google ScholarCross Ref
Wang, Y., and Wong, A. K. C. 2010. Discover*e. Pattern Discovery Technologies. DOI= http://www.patterndiscovery.com.Google Scholar
Wang, Y., and Wong, A. K. C. 2003. From association to classification: inference using weight of evidence. IEEE T. Knowl. Data. En. 15, 3, 914--925, 200. Google ScholarDigital Library
Wong, A. K. C., and Wang, Y. 2003. Pattern discovery: a data driven approach to decision support. IEEE T Syst. Man Cy. C, 33, 1, 114--124. Google ScholarDigital Library
Wong, A. K. C., and Chiu, D. K. Y. 1987. Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE T. Pattern Anal. 9, 8 (Nov. 1987), 796--805. Google ScholarDigital Library
Wong, A. K. C., and Liu, T. S. 1975. Typicality, diversity and feature patterns of an ensemble. IEEE T. Comput. 24, 2 (Feb. 1975), 158--181. Google ScholarDigital Library
Wong, A. K. C., and Wang, Y. 1997. High order pattern discovery from discrete-valued data. IEEE T. Knowl. Data. En. 9, 6 (Nov. 1997), 877--893. Google ScholarDigital Library
Wong, A. K. C., Chiu D. K. Y., and Huang, W. 2001. A discrete-valued clustering algorithm with applications to bimolecular data. Information Sciences, 139, 1--2 (Nov. 2001), 97--112. Google ScholarDigital Library
Wong, A. K. C., Liu, T. S., and Wang, C. C. 1976. Statistical analysis of residue variability in Cytochrome C. J. Mol. Biol. 102, 2(Apr. 1976), 287--295.Google ScholarCross Ref
Wong, A. K. C., and Li, G. C. L. 2008. Simultaneous pattern and data clustering for pattern cluster analysis. IEEE T. Knowl. Data. En. 20, 7 (Jul. 2008), 911--923. Google ScholarDigital Library
Wong, A. K. C., and Li, G. C. L. 2010. Association pattern analysis for pattern pruning, pattern clustering and summarization, to appear in Journal of Knowledge and Information Systems, 2010.Google Scholar

Index Terms

Pattern discovery for large mixed-mode database
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data

This paper presents an attribute clustering method which is able to group genes based on their interdependence so as to mine meaningful patterns from the gene expression data. It can be used for gene grouping, selection, and classification. The ...
Read More
An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

Data mining uncovers hidden, previously unknown, and potentially useful information from large amounts of data. Compared to the traditional statistical and machine learning data analysis techniques, data mining emphasizes providing a convenient and ...
Read More
A constrained frequent pattern mining system for handling aggregate constraints
IDEAS '12: Proceedings of the 16th International Database Engineering & Applications Sysmposium

Frequent pattern mining searches data for sets of items that are frequently co-occurring together. Most of algorithms find all the frequent patterns. However, there are many real-life situations in which users is interested in only some small portions ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
October 2010
2036 pages
ISBN:9781450300995
DOI:10.1145/1871437
General Chair:
Jimmy Huang
York University, Canada
,
Program Chairs:
Nick Koudas
University of Toronto, Canada
,
Gareth Jones
Dublin City University, Ireland
,
Xindong Wu
University of Vermont, USA
,
Kevyn Collins-Thompson
Microsoft Research, USA
,
Aijun An
York University, Canada
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
attribute clustering
data mining
mixed mode data
mutual information
pattern discovery
unsupervised discretization
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 307
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Pattern discovery for large mixed-mode database

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data

An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

A constrained frequent pattern mining system for handling aggregate constraints