skip to main content
10.1145/1014052.1014083acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Turning CARTwheels: an alternating algorithm for mining redescriptions

Published: 22 August 2004 Publication History

Abstract

We present an unusual algorithm involving classification trees---CARTwheels---where two trees are grown in opposite directions so that they are joined at their leaves. This approach finds application in a new data mining task we formulate, called redescription mining. A redescription is a shift-of-vocabulary, or a different way of communicating information about a given subset of data; the goal of redescription mining is to find subsets of data that afford multiple descriptions. We highlight the importance of this problem in domains such as bioinformatics, which exhibit an underlying richness and diversity of data descriptors (e.g., genes can be studied in a variety of ways). CARTwheels exploits the duality between class partitions and path partitions in an induced classification tree to model and mine redescriptions. It helps integrate multiple forms of characterizing datasets, situates the knowledge gained from one dataset in the context of others, and harnesses high-level abstractions for uncovering cryptic and subtle features of data. Algorithm design decisions, implementation details, and experimental results are presented.

References

[1]
R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of VLDB'94, pages 487--499, Sep 1994.
[2]
P.A. Bernstein, R. Pottinger, and A.Y. Halevy. A Vision for Management of Complex Models. SIGMOD Record, Vol. 29(4):pages 55--63, Dec 2000.
[3]
L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Chapman and Hall/CRC, 1984.
[4]
T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley and Sons, 1991.
[5]
D.H. Fisher. Knowledge Acquisition via Incremental Conceptual Clustering. Machine Learning, Vol. 2(2):pages 139--172, 1987.
[6]
V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS: Clustering Categorical Data using Summaries. In Proceedings of KDD'99, pages 73--83, Aug 1999.
[7]
V. Ganti, J. Gehrke, and R. Ramakrishnan. Mining Very Large Databases. IEEE Computer, Vol. 32(8):pages 38--45, Aug 1999.
[8]
A.P. Gasch, P.T. Spellman, C.M. Kao, O. Carmel-Harel, M.B. Eisen, G. Storz, D. Botstein, and P.O. Brown. Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes. Molecular Biology of the Cell, Vol. 11:pages 4241--4257, 2000.
[9]
J. Gehrke, R. Ramakrishnan, and V. Ganti. RainForest: A Framework for Fast Decision Tree Construction of Large Datasets. Data Mining and Knowledge Discovery, Vol. 4(2/3):pages 127--162, July 2000.
[10]
J.C. Gower and P. Legendre. Metric and Euclidean Properties of Dissimilarity Coefficients. Journal of Classification, Vol. 3:pages 5--48, 1986.
[11]
W.P. Jones and G.W. Furnas. Pictures of Relevance: A Geometric Analysis of Similarity Measures. Journal of the American Society for Information Science, Vol. 38(6):pages 420--442, 1987.
[12]
R.S. Michalski. Knowledge Acquisition through Conceptual Clustering: A Theoretical Framework and Algorithm for Partitioning Data into Conjunctive Concepts. International Journal of Policy Analysis and Information Systems, Vol. 4:pages 219--243, 1980.
[13]
A.W. Moore and M.S. Lee. Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets. JAIR, Vol. 8:pages 67--91, 1998.
[14]
S. Muggleton. Scientific Knowledge Discovery using Inductive Logic Programming. CACM, Vol. 42(11):pages 42--46, Nov 1999.
[15]
J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[16]
E. Rahm and P.A. Bernstein. A Survey of Approaches to Automatic Schema Matching. VLDB Journal, Vol. 10(4):pages 334--350, 2001.
[17]
E. Segal, M. Shapira, A. Regev, D. Pe'er, D. Botstein, D. Koller, and N. Friedman. Module Networks: Identifying Regulatory Modules and their Condition-Specific Regulators from Gene Expression Data. Nature Genetics, Vol. 34(2):pages 166--176, 2003.
[18]
A. Sturn, J. Quackenbush, and Z. Trajanoski. Genesis: Cluster Analysis of Microarray Data. Bioinformatics, Vol. 18(1):pages 207--208, 2002.
[19]
R.E. Valdes-Perez, V. Pericliev, and F. Pereira. Concise, Intelligible, and Approximate Profiling of Multiple Classes. International Journal of Human Computer Studies, Vol. 53(3):pages 411--436, 2000.
[20]
J.J. Wyrick, F.C. Holstege, E.G. Jennings, H.C. Causton, D. Shore, M. Grunstein, E.S. Lander, and R.A. Young. Chromosomal Landscape of Nucleosome-Dependent Gene Expression and Silencing in Yeast. Nature, Vol. 402:pages 418--421, 1999.
[21]
M. Zaki. Generating Non-Redundant Association Rules. In Proceedings of KDD'00, pages 34--43, 2000.

Cited By

View all

Index Terms

  1. Turning CARTwheels: an alternating algorithm for mining redescriptions

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2004
      874 pages
      ISBN:1581138881
      DOI:10.1145/1014052
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 August 2004

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. classification trees
      2. data mining in biological domains
      3. redescriptions

      Qualifiers

      • Article

      Conference

      KDD04

      Acceptance Rates

      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Upcoming Conference

      KDD '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)11
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 20 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Redescription mining-based business process deviance analysisSoftware and Systems Modeling (SoSyM)10.1007/s10270-024-01231-823:6(1421-1450)Online publication date: 1-Dec-2024
      • (2024)Fast Redescription Mining Using Locality-Sensitive HashingMachine Learning and Knowledge Discovery in Databases. Research Track10.1007/978-3-031-70368-3_8(124-142)Online publication date: 8-Sep-2024
      • (2023)Interactive redescription set mining and exploration2023 46th MIPRO ICT and Electronics Convention (MIPRO)10.23919/MIPRO57284.2023.10159966(303-308)Online publication date: 22-May-2023
      • (2023)Redistrict: Designing a Self-Serve Interactive Boundary Optimization SystemCompanion Publication of the 2023 ACM Designing Interactive Systems Conference10.1145/3563703.3595662(284-287)Online publication date: 10-Jul-2023
      • (2023)On the complexity of redescription miningTheoretical Computer Science10.1016/j.tcs.2022.12.023944:COnline publication date: 25-Jan-2023
      • (2023)Differentially private tree-based redescription miningData Mining and Knowledge Discovery10.1007/s10618-023-00934-837:4(1548-1590)Online publication date: 16-Apr-2023
      • (2023)Rules, Subgroups and Redescriptions as Features in Classification TasksMachine Learning and Principles and Practice of Knowledge Discovery in Databases10.1007/978-3-031-23618-1_17(248-260)Online publication date: 31-Jan-2023
      • (2022)Inferring COVID-19 Biological Pathways from Clinical Phenotypes Via Topological AnalysisAI for Disease Surveillance and Pandemic Intelligence10.1007/978-3-030-93080-6_12(147-163)Online publication date: 9-Mar-2022
      • (2021)Approaches for Multi-View Redescription MiningIEEE Access10.1109/ACCESS.2021.30542459(19356-19378)Online publication date: 2021
      • (2020)Mining Heterogeneous Associations from Pediatric Cancer Data by Relational Concept Analysis2020 International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW51313.2020.00085(597-604)Online publication date: Nov-2020
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media