Skip to main content
Log in

Principal Direction Divisive Partitioning

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

We propose a new algorithm capable of partitioning a set of documents or other samples based on an embedding in a high dimensional Euclidean space (i.e., in which every document is a vector of real numbers). The method is unusual in that it is divisive, as opposed to agglomerative, and operates by repeatedly splitting clusters into smaller clusters. The documents are assembled into a matrix which is very sparse. It is this sparsity that permits the algorithm to be very efficient. The performance of the method is illustrated with a set of text documents obtained from the World Wide Web. Some possible extensions are proposed for further investigation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Anderberg, M.R. 1973. Cluster Analysis for Applications. Academic Press.

  • Berry, M.W., Dumais, S.T., and O'Brien, G.W. 1995. Using linear algebra for intelligent information retrieval. SIAM Review, 37:573-595.

    Google Scholar 

  • Bishop, C. and Tipping, M. 1998. A hierarchical latent variable model for data visualization. IEEE Trans. Patt. Anal. Mach. Intell., 20(3):281-293.

    Article  Google Scholar 

  • Boley, D. 1998. Experimental PDDP Software. http://www.cs.umn.edu/∼boley/PDDP.html.

  • Boley, D., Gini, M., Gross, R., Han, E.-H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J. 1998. Document categorization and query generation on the world wide web using WebACE. AI Review, to appear.

  • Cheeseman, P. and Stutz, J. 1996. Bayesian Classification (AutoClass): Theory and Results. In Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI/MIT Press, pp. 153-180.

  • Cutting, D., Karger, D., Pedersen, J., and Tukey, J. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. 15th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'92), pp. 318-329.

  • Duda, R.O. and Hart, P.E. 1973. Pattern Classification and Scene Analysis. John Wiley & Sons.

  • Frakes, W.B. and Baeza-Yates, R. 1992. Information Retrieval Data Structures and Algorithms. Englewood Cliffs, NJ: Prentice Hall.

    Google Scholar 

  • Golub, G.H. and van Loan, C.F. 1996. Matrix Computations, 3rd edition. Johns Hopkins Univ. Press.

  • Han, S., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J. 1998. WebACE: AWeb Agent for Document Categorization and Exploration, Proceedings ACM Autonomous Agents'98 Conference, Minneapolis, MN. pp. 408-415.

  • Hull, D., Pederson, J., and Schütze, H. 1996. Method Combination for Document Filtering. ACM SIGIR 96, 279-287.

  • Jain, A. and Dubes, R.C. 1988. Algorithms for Clustering Data. Prentice Hall.

  • Lewis, D. 1997. Reuters-21578. http://www.research.att.com/∼lewis.

  • Lu, S. and Fu, K. 1978. A sentence-to-sentence clustering procedure for pattern analysis. IEEE Transactions on Systems, Man and Cybernetics, 8:381-389.

    Google Scholar 

  • Moore, J., Han, S., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., and Mobasher, B. 1997. Web page categorization and feature selection using association rule and principal component clustering. 7th Workshop on Information Technologies and Systems (WITS'97), Atlanta.

  • Nadler, M. and Smith, E.P. 1993. Pattern Recognition Engineering. Wiley.

  • Northern Light, 1998. http://www.nlsearch.com.

  • Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523.

    Article  Google Scholar 

  • Schütwe, H. and Silverstein, C. 1997. Projections for efficient document clustering. ACM SIGIR 97, pp. 74-81.

    Google Scholar 

  • Singhal, A., Buckley, C., and Mitra, M. 1996. Pivoted document length normalization. ACMSIGIR 96, pp. 21-29.

    Google Scholar 

  • Titterington, D., Smith, A., and Makov, U. 1985. Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons.

  • Zamir, O., Etzioni, O., Madani, O., and Karp, R. 1997. Fast and intuitive clustering of web documents. KDD 97.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boley, D. Principal Direction Divisive Partitioning. Data Mining and Knowledge Discovery 2, 325–344 (1998). https://doi.org/10.1023/A:1009740529316

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1009740529316

Keywords

Navigation