research-article

Redefining class definitions using constraint-based clustering: an application to remote sensing of the earth's surface

Authors:
Dan R. Preston

Tufts University, Medford, MA, USA

Tufts University, Medford, MA, USA
View Profile

,
Carla E. Brodley

Tufts University, Medford, MA, USA

Tufts University, Medford, MA, USA
View Profile

,
Roni Khardon

Tufts University, Medford, MA, USA

Tufts University, Medford, MA, USA
View Profile

,
Damien Sulla-Menashe

Boston University, Boston, MA, USA

Boston University, Boston, MA, USA
View Profile

,
Mark Friedl

Boston University, Boston, MA, USA

Boston University, Boston, MA, USA
View Profile

KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data miningJuly 2010Pages 823–832https://doi.org/10.1145/1835804.1835908

Published:25 July 2010Publication History

KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 823–832

ABSTRACT

Two aspects are crucial when constructing any real world supervised classification task: the set of classes whose distinction might be useful for the domain expert, and the set of classifications that can actually be distinguished by the data. Often a set of labels is defined with some initial intuition but these are not the best match for the task. For example, labels have been assigned for land cover classification of the Earth but it has been suspected that these labels are not ideal and some classes may be best split into subclasses whereas others should be merged. This paper formalizes this problem using three ingredients: the existing class labels, the underlying separability in the data, and a special type of input from the domain expert. We require a domain expert to specify an L × L matrix of pairwise probabilistic constraints expressing their beliefs as to whether the L classes should be kept separate, merged, or split. This type of input is intuitive and easy for experts to supply. We then show that the problem can be solved by casting it as an instance of penalized probabilistic clustering (PPC). Our method, Class-Level PPC (CPPC) extends PPC showing how its time complexity can be reduced from O(N²) to O(NL) for the problem of class re-definition. We further extend the algorithm by presenting a heuristic to measure adherence to constraints, and providing a criterion for determining the model complexity (number of classes) for constraint-based clustering. We demonstrate and evaluate CPPC on artificial data and on our motivating domain of land cover classification. For the latter, an evaluation by domain experts shows that the algorithm discovers novel class definitions that are better suited to land cover classification than the original set of labels.

Supplemental Material

kdd2010_preston_rcdu_01.mov

mov

110.2 MB

Download

References

A. M. Aisen et al. Automated storage and retrieval of medical images to assist diagnosis: Implementation and preliminary assessment. Radiology, 228:265--270, July 2003.Google ScholarCross Ref
H. Akaike. A new look at the statistical identification model. IEEE Trans. Auto Control, AC-19:716--723, 1974.Google ScholarCross Ref
S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised clustering by seeding. In ICML, pages 27--34, 2002. Google ScholarDigital Library
S. Basu, I. Davidson, and K. Wagstaff. Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall/CRC, 2008. Google ScholarDigital Library
C.-C. Chen and D. Landgrebe. A spectral feature design system for the hiris/modis era. Geoscience and Remote Sensing, IEEE Transactions on, 27(6):681--686, Nov 1989.Google Scholar
U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data-mining to knowledge discovery: An overview. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurasamy, editors, Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. Google ScholarDigital Library
M. Friedl et al. Global land cover mapping from MODIS: Algorithms and early results. Remote Sensing of Environment, 83:287--302, 2002.Google ScholarCross Ref
H. Ghassemian and D. Landgrebe. Object-oriented feature extraction method for image data compaction. Control Systems Magazine, IEEE, 8(3):42--48, Jun 1988.Google ScholarCross Ref
E. J. Hannan and B. G. Quinn. The determination of the order of an autoregression. Journal of the Royal Statistical Society, Series B, 41(2):190--195, 1979.Google Scholar
T. S. Jaakkola. Tutorial on variational approximation methods. In Advanced Mean Field Methods: Theory and Practice, pages 129--159. MIT Press, 2000.Google Scholar
I. T. Jolliffe. Principal component analysis. Springer Series in Statistics, 1986.Google Scholar
B. Kulis, S. Basu, I. S. Dhillon, and R. J. Mooney. Semi-supervised graph clustering: a kernel approach. Machine Learning, 74(1):1--22, 2009. Google ScholarDigital Library
M. H. C. Law, E. Topchy, and A. K. Jain. Clustering with soft and group constraints. Proc. Joint IAPR International Workshops on Structural, Syntactic, And Statistical Pattern Recognition, pages 662--670, 2004.Google ScholarCross Ref
T. Loveland et al. Development of a global land cover characteristics database and IGBP DISCover from 1-km AVHRR data. Remote Sensing of Environment, 83:287--302, 2002.Google Scholar
Z. Lu and T. K. Leen. Penalized probabilistic clustering. Neural Comput., 19(6):1528--1567, 2007. Google ScholarDigital Library
J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. of the Fifth Sym. on Math, Statistics, and Probability, pages 281--297, 1967.Google Scholar
G. J. McLachlan and K. E. Basford. Mixture models. Inference and applications to clustering. Statistics: Textbooks and Monographs, 1988.Google Scholar
M. Pugh and A. Waxman. Classification of spectrally-similar land cover using multi-spectral neural image fusion and the fuzzy artmap neural classifier. In IGARSS 2006, pages 1808--1811, 31 2006-Aug. 4 2006.Google ScholarCross Ref
G. Schwartz. Estimating the dimension of a model. The Annals of Statistics, 5(2):461--464, 1978.Google ScholarCross Ref
N. Shental, A. Bar-hillel, and D. Weinshall. Computing gaussian mixture models with EM using equivalence constraints. In In Advances in Neural Information Processing Systems 16. MIT Press, 2003.Google Scholar
S. D. Spiegelhalter et al. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B, 64(4):583--639, 2002.Google ScholarCross Ref
N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton. Split and merge EM algorithm for improving gaussian mixture density estimates. J. VLSI Signal Process. Syst., 26(1-2):133--140, 2000. Google ScholarDigital Library
U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395--416, 2007. Google ScholarDigital Library
K. Wagstaff, C. Cardie, and S. Schroedl. Constrained k-means clustering with background knowledge. In ICML, pages 577--584, 2001. Google ScholarDigital Library
Q. Zhao and D. J. Miller. Mixture modeling with pairwise, instance-level class constraints. Neural Computation, 17(11):2482--2507, 2005. Google ScholarDigital Library

Index Terms

Redefining class definitions using constraint-based clustering: an application to remote sensing of the earth's surface
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

A Mixture Model and EM-Based Algorithm for Class Discovery, Robust Classification, and Outlier Rejection in Mixed Labeled/Unlabeled Data Sets

Several authors have shown that, when labeled data are scarce, improved classifiers can be built by augmenting the training set with a large set of unlabeled examples and then performing suitable learning. These works assume each unlabeled sample ...
Read More
Semisupervised Clustering with Metric Learning using Relative Comparisons

Semi-supervised clustering algorithms partition a given data set using limited supervision from the user. The success of these algorithms depend on the type of supervision and also on the kind of dissimilarity measure used while creating partitions of ...
Read More
Semi-supervised learning using multiple clusterings with limited labeled data

Supervised classification consists in learning a predictive model using a set of labeled samples. It is accepted that predictive models accuracy usually increases as more labeled samples are available. Labeled samples are generally difficult to obtain ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
July 2010
1240 pages
ISBN:9781450300551
DOI:10.1145/1835804
General Chairs:
Bharat Rao
Siemens
,
Balaji Krishnapuram
Siemens
,
Program Chairs:
Andrew Tomkins
Google Inc.
,
Qiang Yang
Hong Kong University of Science and Technology
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
class discovery
constraint-based clustering
kdd-process
mining scientific data
remote sensing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 423
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Redefining class definitions using constraint-based clustering: an application to remote sensing of the earth's surface

KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

A Mixture Model and EM-Based Algorithm for Class Discovery, Robust Classification, and Outlier Rejection in Mixed Labeled/Unlabeled Data Sets

Semisupervised Clustering with Metric Learning using Relative Comparisons

Semi-supervised learning using multiple clusterings with limited labeled data