ABSTRACT
In some data sets the number of categories (i.e. classes) that are represented is not known in advance. The process of discovering these categories can be difficult, particularly when a data set is skewed, such that the number of data points of some classes may greatly exceed those of other classes. Rare category detection algorithms address this problem by trying to present a user with at least one data point from each category, while minimizing the overall number of data points presented. We present an algorithm based on active and semi-supervised learning that finds category clusters using a query selection strategy that maximizes the distance from a set of already labeled data points to a query data point. We evaluate the algorithm's performance on artificially skewed versions of the MNIST data set as a rare category detection algorithm, investigating differences in performance due to both the effects of relative frequency and inherent class structure differences in feature space.
- B. Settles, "Active learning," Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 6, no. 1, pp. 1--114, 2012.Google ScholarCross Ref
- D. Pelleg and A. W. Moore, "Active learning for anomaly and rarecategory detection," in Advances in neural information processing systems, 2005, pp. 1073--1080.Google Scholar
- E. Bair, "Semi-supervised clustering methods," Wiley Interdisciplinary Reviews: Computational Statistics, vol. 5, no. 5, pp. 349--361, 2013.Google ScholarDigital Library
- J. He, Analysis of rare categories. Springer Science & Business Media, 2012.Google ScholarCross Ref
- J. He and J. G. Carbonell, "Nearest-neighbor-based active learning for rare category detection," in Advances in neural information processing systems, 2008, pp. 633--640.Google Scholar
- K. Wagstaff, C. Cardie, S. Rogers, S. Schrodl et al., "Constrained kmeans clustering with background knowledge," in Icml, vol. 1, 2001, pp. 577--584.Google ScholarDigital Library
- O. Chapelle, B. Scholkopf, and A. Zien, "Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]," IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 542--542, 2009.Google ScholarDigital Library
- R. Loveland, "farpoint," https://github.com/rohan-loveland/farpoint, 2019.Google Scholar
- S. Basu, A. Banerjee, and R. J. Mooney, "Active semi-supervision for pairwise constrained clustering," in Proceedings of the 2004 SIAM international conference on data mining. SIAM, 2004, pp. 333--344.Google Scholar
- S. Dasgupta and D. Hsu, "Hierarchical sampling for active learning," in Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 208--215.Google Scholar
- T. Van Craenendonck, S. Dumancič, E. Van Wolputte, and H. Blockeel,' "Cobras: Fast, iterative, active clustering with pairwise constraints," arXiv preprint arXiv:1803.11060, 2018.Google Scholar
- U. Von Luxburg, R. C. Williamson, and I. Guyon, "Clustering: Science or art?" in Proceedings of ICML Workshop on Unsupervised and Transfer Learning, 2012, pp. 65--79.Google Scholar
- Y. LeCun and C. Cortes, "MNIST handwritten digit database," 2010. [Online]. Available: http://yann.lecun.com/exdb/mnist/Google Scholar
Index Terms
- Far Point Algorithm: Active Semi-supervised Clustering for Rare Category Detection
Recommendations
Far efficient K-means clustering algorithm
ICACCI '12: Proceedings of the International Conference on Advances in Computing, Communications and InformaticsClustering in data analysis means data with similar features are grouped together within a particular valid cluster. Each cluster consists of data that are more similar among themselves and dissimilar to data of other clusters. Clustering can be viewed ...
Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm
In comparison with hard clustering methods, in which a pattern belongs to a single cluster, fuzzy clustering algorithms allow patterns to belong to all clusters with differing degrees of membership. This is important in domains such as sentence ...
Using the stability of objects to determine the number of clusters in datasets
A novel method for assessing the stability of objects and clusters is presented.The new method is based on multiple runs of a partitioning algorithm.It can be used to determine the number of clusters in complex datasets.The introduced stability indices ...
Comments