Clustering of Distributions: A Case of Patent Citations

Kejžar, Nataša; Korenjak-Černe, Simona; Batagelj, Vladimir

doi:10.1007/s00357-011-9084-x

Clustering of Distributions: A Case of Patent Citations

Published: 18 June 2011

Volume 28, pages 156–183, (2011)
Cite this article

Journal of Classification Aims and scope Submit manuscript

Nataša Kejžar¹,
Simona Korenjak-Černe² &
Vladimir Batagelj³

264 Accesses
6 Citations
Explore all metrics

Abstract

Often the data units are described with discrete distributions (work described with citation distribution over time, population pyramid described as age-sex distribution etc.).When the set of such units is very large, appropriate clustering methods can reveal the typical patterns hidden in the data.

In this paper we present an adapted leaders method combined with a compatible adapted agglomerative hierarchical method that are based on relative error measure between a unit and the corresponding cluster representative–leader. The proposed approach is illustrated on citation distributions derived from the data set of US patents from 1980 to 1999. These new methods were developed because clustering of units, described with distributions, with classical k-means method reveals patterns with single high peaks which correspond to a single year. These patterns prevail over other distribution shapes also present in the data. Compared with centers in k-means method, clusters’ representatives obtained with the proposed new methods better detect typical distribution shapes of units. The obtained main cluster types for different sets of units show three main patterns: patents with early or late peak of importance to the community, and patents where the importance is slowly increasing throughout the time period.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Marshall-Olkin distributions: a bibliometric study

Article 09 October 2021

Modelling citation networks

Article 05 September 2015

Consistency and Trends of Technological Innovations: A Network Approach to the International Patent Classification Data

References

ANDERBERG, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press.
MATH Google Scholar
BATAGELJ, V. (1988), “Generalized Ward and Related Clustering Problems”, in Classification and Related Methods of Data Analysis, ed. H.H. Bock, North-Holland: Amsterdam, pp. 67–74.
Google Scholar
BICKEL, P.J., and DOKSUM, K.J. (1977), Mathematical Statistics: Basic Ideas and Selected Topics, Oakland: Holden-Day, Inc.
MATH Google Scholar
BRUCKER, P. (1978), “On the Complexity of Clustering Problems”, in Lecture Notes in Economics and Mathematical Systems: Optimizing and Operational Research, eds. R. Henn, B. Korte, and W. Oletti, Berlin: Springer, pp. 45–54.
Google Scholar
CLUSTDDIST–R PACKAGE (2009), Test Version of an R Package for Clustering of Distributions, by N. Kejžar, V. Batagelj, and S. Korenjak-Černe, https://r-forge.rproject.org/projects/clustddist/.
DIDAY, E. et al. (1979), Optimisation en classification automatique, Tomes 1., 2., Rocquencourt: INRIA.
Google Scholar
FORGY, E.W. (1965), “Cluster Analysis of Multivariate Data: Efficiency Vs. Interpretability of Classifications”, Biometrics, 21, 768–769.
Google Scholar
GARFIELD, E. (1985), “Uses and Misuses of Citation Frequency”, Current Contents. Essays of an Information Scientist, 8, 403–409.
Google Scholar
GARFIELD, E. (1998a), “Long-Term Vs. Short-Term Journal Impact: Does It Matter?”, The Scientist, 12, 3.
Google Scholar
GARFIELD, E. (1998b), “The Impact Factor and Using It Correctly”, Der Unfallchirurg, 101(6), 413.
Google Scholar
GOWER, J.C., and LEGENDRE, P. (1986), “Metric and Euclidean Properties of Dissimilarity Coefficients”, Journal of Classification, 3, 5–48.
Article MathSciNet MATH Google Scholar
HALL, B.H., JAFFE, A.B., and TRATJENBERG, M. (2001), “The NBER Patent Citation Data File: Lessons, Insights and Methodological Tools”, NBER Working Paper 8498, NBER, http://papers.nber.org/papers/w8498.pdf.
HARTIGAN, J.A. (1975), Clustering Algorithms, New York: Wiley-Interscience.
MATH Google Scholar
HIRSCH, J.E. (2005), “An Index to Quantify an Individual’s Scientific Research Output”, Proceedings of the National Academy of Sciences of the United Stated of America, 102, 16569–16572.
Article Google Scholar
IMU REPORT (2008), “Citation Statistics. A Report from the International Mathematical Union (IMU) in Cooperation with the International Council of Industrial and Applied Mathematics (ICIAM) and the Institute of Mathematical Statistics (IMS)”, by R. Adler, J. Ewing, and P. Taylor, http://www.mathunion.org/fileadmin/IMU/Report/CitationStatistics.pdf.
KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: Wiley.
Google Scholar
KATSAROS, D., SIDIROPOULOS, A., and MANOLOPOUS, Y. (2007), “Age Decaying HIndex for Social Network of Citations”, Proceedings of Workshop on Social Aspects of the Web, Poznan, Poland, April 27.
KEJŽAR, N., KORENJAK-ČERNE, S., and BATAGELJ, V. (2009) “Clustering of Discrete Distributions: New R Package and Comparison of Its Methods”, Abstract for the International Conference IFCS 2009 in Dresden, March 2009.
MACQUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations”, 5th Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297.
MathSciNet Google Scholar
NEWMAN, M.E.J. (2005), “Power Laws, Pareto Distributions and Zipf’s Law”, Contemporary Physics, 46, 5, 323–351.
Article Google Scholar
RAMSEY, J., and SILVERMAN, B.W. (2005), Functional Data Analysis (2nd ed.), New York: Springer-Verlag.
Google Scholar
R DEVELOPMENT CORE TEAM (2008), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, http://www.R-project.org.
RESEARCH REPORT BY UNIVERSITIES UK (2007), “The Use of Bibliometrics to Measure Research Quality in UK Higher Educational Institutions”, 40, October 2007, http://www.universitiesuk.ac.uk/Publications/Pages/Publication-275.aspx.
SALTON, G. (1989), Authomatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Reading, Massachusetts: Addison-Wesley.
Google Scholar
SIDIROPOULOS, A., KATSAROS, D., and MANOLOPOUS, Y. (2006), “Generalized Hindex for Revealing Latent Facts in Social Networks of Citations”, Proceedings of the 4th ACM International Workshop on Link Analysis: Dynamics and Static of Large Networks (LinkKDD), (in conjunction with ACM KDD), ACM Press, pp. 45–52.
SPÄTH, H. (1977), Cluster-Analyse-Algorithmen, München: R. Oldenbourg.
MATH Google Scholar
VINOD, H. (1969), “Integer Programming and the Theory of Grouping”, Journal of American Statistical Association, 64, 506–517.
Article MATH Google Scholar
WARD, J.H. (1963), “Hierarchical Grouping to Optimize an Objective Function”, Journal of the American Statistical Association, 58, 236–244.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

University of Ljubljana, Faculty of Medicine, Institute of Biostatistics and Medical Informatics, IBMI, Vrazov trg 2, 1000, Ljubljana, Slovenia
Nataša Kejžar
University of Ljubljana, Faculty of Economics, Department of Statistics, Ljubljana, Slovenia
Simona Korenjak-Černe
University of Ljubljana, Faculty of Mathematics and Physics, Department of Mathematics, Ljubljana, Slovenia
Vladimir Batagelj

Authors

Nataša Kejžar
View author publications
You can also search for this author in PubMed Google Scholar
Simona Korenjak-Černe
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Batagelj
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nataša Kejžar.

Additional information

The authors would like to thank the anonymous referees for many valuable comments and suggestions how to improve this paper. This work was partially supported by the Slovenian Research Agency, Project J1-6062-0101.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kejžar, N., Korenjak-Černe, S. & Batagelj, V. Clustering of Distributions: A Case of Patent Citations. J Classif 28, 156–183 (2011). https://doi.org/10.1007/s00357-011-9084-x

Download citation

Published: 18 June 2011
Issue Date: July 2011
DOI: https://doi.org/10.1007/s00357-011-9084-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering of Distributions: A Case of Patent Citations

Abstract

Access this article

Similar content being viewed by others

Marshall-Olkin distributions: a bibliometric study

Modelling citation networks

Consistency and Trends of Technological Innovations: A Network Approach to the International Patent Classification Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering of Distributions: A Case of Patent Citations

Abstract

Access this article

Similar content being viewed by others

Marshall-Olkin distributions: a bibliometric study

Modelling citation networks

Consistency and Trends of Technological Innovations: A Network Approach to the International Patent Classification Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation