Abstract
We propose a new algorithm capable of partitioning a set of documents or other samples based on an embedding in a high dimensional Euclidean space (i.e., in which every document is a vector of real numbers). The method is unusual in that it is divisive, as opposed to agglomerative, and operates by repeatedly splitting clusters into smaller clusters. The documents are assembled into a matrix which is very sparse. It is this sparsity that permits the algorithm to be very efficient. The performance of the method is illustrated with a set of text documents obtained from the World Wide Web. Some possible extensions are proposed for further investigation.
Similar content being viewed by others
References
Anderberg, M.R. 1973. Cluster Analysis for Applications. Academic Press.
Berry, M.W., Dumais, S.T., and O'Brien, G.W. 1995. Using linear algebra for intelligent information retrieval. SIAM Review, 37:573-595.
Bishop, C. and Tipping, M. 1998. A hierarchical latent variable model for data visualization. IEEE Trans. Patt. Anal. Mach. Intell., 20(3):281-293.
Boley, D. 1998. Experimental PDDP Software. http://www.cs.umn.edu/∼boley/PDDP.html.
Boley, D., Gini, M., Gross, R., Han, E.-H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J. 1998. Document categorization and query generation on the world wide web using WebACE. AI Review, to appear.
Cheeseman, P. and Stutz, J. 1996. Bayesian Classification (AutoClass): Theory and Results. In Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI/MIT Press, pp. 153-180.
Cutting, D., Karger, D., Pedersen, J., and Tukey, J. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. 15th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'92), pp. 318-329.
Duda, R.O. and Hart, P.E. 1973. Pattern Classification and Scene Analysis. John Wiley & Sons.
Frakes, W.B. and Baeza-Yates, R. 1992. Information Retrieval Data Structures and Algorithms. Englewood Cliffs, NJ: Prentice Hall.
Golub, G.H. and van Loan, C.F. 1996. Matrix Computations, 3rd edition. Johns Hopkins Univ. Press.
Han, S., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J. 1998. WebACE: AWeb Agent for Document Categorization and Exploration, Proceedings ACM Autonomous Agents'98 Conference, Minneapolis, MN. pp. 408-415.
Hull, D., Pederson, J., and Schütze, H. 1996. Method Combination for Document Filtering. ACM SIGIR 96, 279-287.
Jain, A. and Dubes, R.C. 1988. Algorithms for Clustering Data. Prentice Hall.
Lewis, D. 1997. Reuters-21578. http://www.research.att.com/∼lewis.
Lu, S. and Fu, K. 1978. A sentence-to-sentence clustering procedure for pattern analysis. IEEE Transactions on Systems, Man and Cybernetics, 8:381-389.
Moore, J., Han, S., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., and Mobasher, B. 1997. Web page categorization and feature selection using association rule and principal component clustering. 7th Workshop on Information Technologies and Systems (WITS'97), Atlanta.
Nadler, M. and Smith, E.P. 1993. Pattern Recognition Engineering. Wiley.
Northern Light, 1998. http://www.nlsearch.com.
Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523.
Schütwe, H. and Silverstein, C. 1997. Projections for efficient document clustering. ACM SIGIR 97, pp. 74-81.
Singhal, A., Buckley, C., and Mitra, M. 1996. Pivoted document length normalization. ACMSIGIR 96, pp. 21-29.
Titterington, D., Smith, A., and Makov, U. 1985. Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons.
Zamir, O., Etzioni, O., Madani, O., and Karp, R. 1997. Fast and intuitive clustering of web documents. KDD 97.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Boley, D. Principal Direction Divisive Partitioning. Data Mining and Knowledge Discovery 2, 325–344 (1998). https://doi.org/10.1023/A:1009740529316
Issue Date:
DOI: https://doi.org/10.1023/A:1009740529316