Skip to main content
Log in

Clustering of Distributions: A Case of Patent Citations

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

Often the data units are described with discrete distributions (work described with citation distribution over time, population pyramid described as age-sex distribution etc.).When the set of such units is very large, appropriate clustering methods can reveal the typical patterns hidden in the data.

In this paper we present an adapted leaders method combined with a compatible adapted agglomerative hierarchical method that are based on relative error measure between a unit and the corresponding cluster representative–leader. The proposed approach is illustrated on citation distributions derived from the data set of US patents from 1980 to 1999. These new methods were developed because clustering of units, described with distributions, with classical k-means method reveals patterns with single high peaks which correspond to a single year. These patterns prevail over other distribution shapes also present in the data. Compared with centers in k-means method, clusters’ representatives obtained with the proposed new methods better detect typical distribution shapes of units. The obtained main cluster types for different sets of units show three main patterns: patents with early or late peak of importance to the community, and patents where the importance is slowly increasing throughout the time period.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • ANDERBERG, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press.

    MATH  Google Scholar 

  • BATAGELJ, V. (1988), “Generalized Ward and Related Clustering Problems”, in Classification and Related Methods of Data Analysis, ed. H.H. Bock, North-Holland: Amsterdam, pp. 67–74.

    Google Scholar 

  • BICKEL, P.J., and DOKSUM, K.J. (1977), Mathematical Statistics: Basic Ideas and Selected Topics, Oakland: Holden-Day, Inc.

    MATH  Google Scholar 

  • BRUCKER, P. (1978), “On the Complexity of Clustering Problems”, in Lecture Notes in Economics and Mathematical Systems: Optimizing and Operational Research, eds. R. Henn, B. Korte, and W. Oletti, Berlin: Springer, pp. 45–54.

    Google Scholar 

  • CLUSTDDIST–R PACKAGE (2009), Test Version of an R Package for Clustering of Distributions, by N. Kejžar, V. Batagelj, and S. Korenjak-Černe, https://r-forge.rproject.org/projects/clustddist/.

  • DIDAY, E. et al. (1979), Optimisation en classification automatique, Tomes 1., 2., Rocquencourt: INRIA.

    Google Scholar 

  • FORGY, E.W. (1965), “Cluster Analysis of Multivariate Data: Efficiency Vs. Interpretability of Classifications”, Biometrics, 21, 768–769.

    Google Scholar 

  • GARFIELD, E. (1985), “Uses and Misuses of Citation Frequency”, Current Contents. Essays of an Information Scientist, 8, 403–409.

    Google Scholar 

  • GARFIELD, E. (1998a), “Long-Term Vs. Short-Term Journal Impact: Does It Matter?”, The Scientist, 12, 3.

    Google Scholar 

  • GARFIELD, E. (1998b), “The Impact Factor and Using It Correctly”, Der Unfallchirurg, 101(6), 413.

    Google Scholar 

  • GOWER, J.C., and LEGENDRE, P. (1986), “Metric and Euclidean Properties of Dissimilarity Coefficients”, Journal of Classification, 3, 5–48.

    Article  MathSciNet  MATH  Google Scholar 

  • HALL, B.H., JAFFE, A.B., and TRATJENBERG, M. (2001), “The NBER Patent Citation Data File: Lessons, Insights and Methodological Tools”, NBER Working Paper 8498, NBER, http://papers.nber.org/papers/w8498.pdf.

  • HARTIGAN, J.A. (1975), Clustering Algorithms, New York: Wiley-Interscience.

    MATH  Google Scholar 

  • HIRSCH, J.E. (2005), “An Index to Quantify an Individual’s Scientific Research Output”, Proceedings of the National Academy of Sciences of the United Stated of America, 102, 16569–16572.

    Article  Google Scholar 

  • IMU REPORT (2008), “Citation Statistics. A Report from the International Mathematical Union (IMU) in Cooperation with the International Council of Industrial and Applied Mathematics (ICIAM) and the Institute of Mathematical Statistics (IMS)”, by R. Adler, J. Ewing, and P. Taylor, http://www.mathunion.org/fileadmin/IMU/Report/CitationStatistics.pdf.

  • KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: Wiley.

    Google Scholar 

  • KATSAROS, D., SIDIROPOULOS, A., and MANOLOPOUS, Y. (2007), “Age Decaying HIndex for Social Network of Citations”, Proceedings of Workshop on Social Aspects of the Web, Poznan, Poland, April 27.

  • KEJŽAR, N., KORENJAK-ČERNE, S., and BATAGELJ, V. (2009) “Clustering of Discrete Distributions: New R Package and Comparison of Its Methods”, Abstract for the International Conference IFCS 2009 in Dresden, March 2009.

  • MACQUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations”, 5th Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297.

    MathSciNet  Google Scholar 

  • NEWMAN, M.E.J. (2005), “Power Laws, Pareto Distributions and Zipf’s Law”, Contemporary Physics, 46, 5, 323–351.

    Article  Google Scholar 

  • RAMSEY, J., and SILVERMAN, B.W. (2005), Functional Data Analysis (2nd ed.), New York: Springer-Verlag.

    Google Scholar 

  • R DEVELOPMENT CORE TEAM (2008), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, http://www.R-project.org.

  • RESEARCH REPORT BY UNIVERSITIES UK (2007), “The Use of Bibliometrics to Measure Research Quality in UK Higher Educational Institutions”, 40, October 2007, http://www.universitiesuk.ac.uk/Publications/Pages/Publication-275.aspx.

  • SALTON, G. (1989), Authomatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Reading, Massachusetts: Addison-Wesley.

    Google Scholar 

  • SIDIROPOULOS, A., KATSAROS, D., and MANOLOPOUS, Y. (2006), “Generalized Hindex for Revealing Latent Facts in Social Networks of Citations”, Proceedings of the 4th ACM International Workshop on Link Analysis: Dynamics and Static of Large Networks (LinkKDD), (in conjunction with ACM KDD), ACM Press, pp. 45–52.

  • SPÄTH, H. (1977), Cluster-Analyse-Algorithmen, München: R. Oldenbourg.

    MATH  Google Scholar 

  • VINOD, H. (1969), “Integer Programming and the Theory of Grouping”, Journal of American Statistical Association, 64, 506–517.

    Article  MATH  Google Scholar 

  • WARD, J.H. (1963), “Hierarchical Grouping to Optimize an Objective Function”, Journal of the American Statistical Association, 58, 236–244.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nataša Kejžar.

Additional information

The authors would like to thank the anonymous referees for many valuable comments and suggestions how to improve this paper. This work was partially supported by the Slovenian Research Agency, Project J1-6062-0101.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kejžar, N., Korenjak-Černe, S. & Batagelj, V. Clustering of Distributions: A Case of Patent Citations. J Classif 28, 156–183 (2011). https://doi.org/10.1007/s00357-011-9084-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-011-9084-x

Keywords

Navigation