Elsevier

Pattern Recognition

Volume 95, November 2019, Pages 235-246
Pattern Recognition

Exact memory–constrained UPGMA for large scale speaker clustering

https://doi.org/10.1016/j.patcog.2019.06.018Get rights and content

Highlights

  • We focus on exact hierarchical clustering of large sets of utterances.

  • Hierarchical clustering is challenging due to memory constraints.

  • We propose an efficient, exact and parallel implementation of UPGMA clustering.

  • We extend the Clustering Features concept to speaker recognition scoring functions.

  • We assess the efficiency of our method on datasets including 4 million utterances.

Abstract

This work focuses on clustering large sets of utterances collected from an unknown number of speakers. Since the number of speakers is unknown, we focus on exact hierarchical agglomerative clustering, followed by automatic selection of the number of clusters. Exact hierarchical clustering of a large number of vectors, however, is a challenging task due to memory constraints, which make it ineffective or unfeasible for large datasets. We propose an exact memory–constrained and parallel implementation of average linkage clustering for large scale datasets, showing that its computational complexity is approximately O(N2), but is much faster (up to 40 times in our experiments), than the Reciprocal Nearest Neighbor chain algorithm, which has O(N2) complexity. We also propose a very fast silhouette computation procedure that, in linear time, determines the set of clusters. The computational efficiency of our approach is demonstrated on datasets including up to 4 million speaker vectors.

Introduction

Clustering is a widely used unsupervised machine learning technique for exploratory data analysis, which allows discovering groups of data items that are similar [1], [2]. Clustering has a large scope of application in speaker recognition. Diarization of meetings [3], segmentation of multi–speaker conversations [4], [5], [6], and adaptation of a speaker recognition system to a new domain [7] are successful examples of its usage. In these applications, the number of different speakers and turns is relatively small. Clustering of large number of speakers and speech segments is required, instead, by other applications. Examples of the latter are semi–supervised training of speaker models, and speaker indexing in broadcast news. Clustering can also be useful for the detection of serial fraudsters in speaker verification applications. This type of fraudsters challenge the system performing many calls with different usernames just using their own voice, rather than using more difficult techniques such as replying or synthesizing the target speaker voice.

In this work we focus on unsupervised clustering of a large set of utterances collected from an unknown and large number of speakers. Among the most widely used techniques for clustering, K–means [8] and spectral clustering [9], [10] require the number of clusters to be predefined, whereas mean-shift [11] relies on the selection of a bandwidth parameter, which cannot be easily selected, or which is computationally expensive to estimate for large datasets. Thus in this work we focused on Hierarchical Agglomerative Clustering (HAC), a popular clustering technique, which discovers hierarchical relations between sets of clusters. HAC has been used since long time in many different fields and applications, for example for vector quantization [12], for document clustering [13], or for creating phylogenetic trees of biological data [14].

We aim at obtaining an exact hierarchical agglomerative solution, rather than relying on approximate Nearest Neighbor (NN) clustering techniques, or on other solutions that take local clustering decisions, based on predefined parameters, without considering all current items and clusters. This allows us deferring the estimation of the number of clusters at the end of the procedure, exploiting an internal validity measure applied to each cluster set that can be extracted from the hierarchy.

In particular, we focus on a computational efficient and memory–constrained implementation of the so called Unweighted Pair Group Method with Arithmetic Mean (UPGMA), which progressively merges clusters according to their minimum average distance producing a dendrogram [15]. UPGMA, or average linkage clustering, has been preferred to the single and complete linkage because it gives better performance in terms of cluster and speaker purity [16]. This has been confirmed by our preliminary experiments not reported here due to space limitations.

Comparison with other clustering approaches, requiring predefined parameters, such as the number of clusters, or any kind of thresholds, is out of the scope of this paper.

In the following, to avoid the distinction between similarity scores and distances, we will mostly refer to similarity scores or simply to scores. The considerations that apply to similarity scores are, of course, also valid for distances or dissimilarity scores.

Exact UPGMA of a large number of vectors is a challenging task due to memory and computational constraints. The standard UPGMA algorithm receives as input a score matrix S over a set of N items, thus it requires O(N2) memory, which is simply not available for large datasets. Its running time is dominated by the time required for the selection of the cluster–pairs. By keeping the entries of each row of S in a heap, UPGMA time complexity reduces to O(N2log(N)) [13], [17], [18]. It can be further reduced to O(N2) [18] by using the Reciprocal Nearest Neighbor chain (RNN) algorithm [19], [20], [21], which, given a score matrix S, iteratively builds a chain of nearest neighbor clusters until it finds the nearest neighbor pair.

However, if N is large, we can only provide to RNN the items vectors, rather than the score matrix that would require O(N2) memory. This solves the memory problem, but since the search for the chain of the nearest neighbor clusters is sequential, RNN cannot exploit massive parallel computation. We will further comment on this, and propose an efficient implementation of RNN, after introducing our fast and memory–constrained UPGMA algorithm.

In this work we propose an exact and parallel memory–constrained UPGMA implementation for large scale speaker clustering. Our approach, which will be referred to in the following as k–best UPGMA or K–UPGMA, performs clustering in multiple iterations starting from the singleton clusters. At each iteration, we precompute all the pairwise scores between the current set of clusters. Precomputing blocks of scores in parallel by multiple threads using vectorized dot-products, is much more efficient than searching sequentially the nearest neighbor of a given cluster. We keep the k–best list of cluster scores, obtained by means of a quick–select algorithm that has linear time complexity even on the worst case [22]. An exact dendrogram is grown by merging a subset of clusters up to a given level implicitly imposed by the size of the k–best list of scores kept in memory. Thus, we trade, to some extent, computations for memory by recomputing at every iteration the subset of scores that did not contribute to the growth of the dendrogram in the previous iteration. Since this subset rapidly shrinks at every iteration, we experimentally show that, for a large enough size of the k–best list, the overall complexity remains almost quadratic in the number of vectors in the dataset.

K–UPGMA is similar to the approach proposed in [14] because the clustering process is performed in multiple rounds, but in [14] the matrix of dissimilarities are computed and kept sorted on disk. This is reasonable only if the number of items is not large, and if the complexity of the dissimilarity computation is high, as it is possibly the case for clustering protein sequences. This does not fit our case study.

Hierarchical clustering, up to a predefined level of the dendrogram, is also proposed in [23], with the assumption that higher levels of the dendrogram do not convey interesting information. In this approach, clustering is not performed in multiple rounds because it is supposed that, given a predefined level of the dendrogram, it is possible to store the matrix of precomputed dissimilarities on the memory of different nodes of a cluster of computers. In our case, however, the higher levels of the dendrogram convey interesting information because, even if we cluster millions of utterances, the number of speakers might be limited.

All these approaches, including ours, have at least quadratic complexity. This is unavoidable because they use a global rather than a local clustering decisions strategy, i.e., they compute all the pairwise scores between the current clusters. However, thanks to efficient data structures and algorithms, and exploiting the possibility of parallel computation of the similarity scores between clusters, we show that K–UPGMA has approximately O(N2) computational complexity, but is 40 times faster than RNN. Considering a real use-case, K–UPGMA is able to cluster 900K 400–dimensional speaker vectors in approximately 15 minutes running on a single machine.

Crucial for the clustering efficiency is the possibility of computing the average score between two clusters without evaluating the pairwise distances between all the items of the two clusters. This has been done for cosine or Euclidean scoring functions [24], [25] by associating to each cluster the so called “Clustering Features”, i.e., the number of vectors in the cluster, their mean and variance. We extend the Clustering Features concept to a class of scoring functions commonly used in speaker recognition and in statistical classification. This is obtained by formalizing the scoring functions in terms of dot–products.

Finally, by definition, UPGMA associates to each cluster, represented by a dendrogram entry, the average dissimilarity between its merged clusters. We show that this information can be used for obtaining approximate, but accurate, silhouette values [26], [27]. Using the Silhouette Width Criterion (SWC) [27] we can determine with good accuracy the set of clusters in the dataset, and evaluate their quality.

Summarizing, our novel contributions are:

  • A fast, parallel, and exact implementation of UPGMA for large scale vector clustering.

  • An efficient similarity computation method, based on an extension of the “Clustering Features” concept, for scoring functions commonly used in speaker recognition and statistical classification.

  • A fast approximate Silhouette Width Criterion for the selection of the number of clusters.

The outline of the paper is as follows: Section 2 outlines our memory–constrained K–UPGMA. The details of the main data structures and the algorithm that we use are given in Section 3 together with the vector–valued functions that allow speeding up pairwise similarity score computation. Section 4 presents a detailed description of K–UPGMA. In Section 5 we give several examples of commonly used scoring functions. Section 6 introduces our fast technique for automatic selection of the number of clusters from the dendrogram produced by UPGMA. Our experimental results are illustrated in Section 7, in terms of running time of the algorithm, and of internal and external cluster purity measures. Conclusions are given in Section 8.

Section snippets

K–UPGMA: outline

The outline of K–UPGMA, ignoring for the moment the details that make it effective, is given in Algorithm 1. In order to fulfill memory constraints, K–UPGMA receives as input a set of N vectors, and computes all their pairwise similarity scores, but keeps in memory only the k–best scoring pairs. It then loops selecting the current best scoring cluster pair (Ci, Cj), merging their elements to cluster Cm, appending them to the dendrogram, and updating the k–best list until it is empty. The update

Data structures and scoring functions

Before presenting the details of the K–UPGMA algorithm, let us introduce the main data structures and algorithms that we use. K–UPGMA computes the pairwise similarity scores between N input vectors, corresponding to the elements above the main diagonal of the score matrix. The computational complexity of this step is quadratic. This complexity is amortized by dividing the score matrix into square blocks of size B. A set of T threads, or processes, computes the pairwise scores for these blocks

K–UPGMA detailed steps

The detailed steps of K–UPGMA are summarized in Algorithm 2. Each singleton cluster is associated with a representation vector initialized according to Section 5. T threads (or processes) are created in a single node, or in multiple computation nodes, to compute in concurrency all the pairwise scores of the vector set. Each thread computes the scores of a block B of vectors, and quick–selects the k–best scores for that block. The T individual k–best lists are then merged (using again

Examples of scoring functions

An equivalent formulation of the scoring function (4) is:S(xi,xj)=f(xi)Tg(xj)+h(xi)+h(xj)where h is a scalar–valued function. By setting:f=f¯,g=g¯,h(x)=0,(9) becomes (4), and by setting:f¯(xi)=[f(xi),h(xi),1]T,g¯(xj)=[g(xj),1,h(xj)]Ttheir product (4) is equivalent to (9).

We here show that scoring functions commonly used for clustering can be represented according to (9), and that the size of the functions f(x) and g(x) is equal or close to the original vector size.

We first formulate as in (9)

Silhouette width criterion

A further advantage of using UPGMA is that it is possible to devise a fast technique for computing an approximate Silhouette Width Criterion (SWC) curve [27], which allows estimating the number of speakers in large datasets, and evaluating the quality of the corresponding clusters.

SWC was selected based on the results of the comparison of 40 internal clustering validity measures performed in [27], which shows that SWC is not only one of the few best techniques in relative terms, but it is also

Experiments

This section presents the results of a set of experiments performed on Nuance servers, on a set of pre-extracted e–vectors [36] coming from an anonymized text-dependent dataset of passphrases from phone calls that last 5 seconds on average. All speakers pronounce the same passphrase. This dataset includes 900K length–normalized e–vectors of dimension d=400.

A subset of these vectors, consisting of 350K vectors from 83,123 speakers, are labeled. The average number of utterance per speaker for the

Conclusions

We introduced a fast and exact implementation of UPGMA for large scale vector clustering, and a fast approximate Silhouette Width Criterion, which allows us estimating with good accuracy the number of clusters in a large dataset.

We also described a class of scoring functions that allow obtaining the similarity score of two clusters by means of a single dot–product of the representation vectors associated to the clusters.

A large set of experiments has been performed to evaluate the scalability

Conflict of interest

None.

Sandro Cumani received the Ph.D. in computer and system engineering from Politecnico di Torino, Italy, in 2011. He is a Research Fellow in the Department of Control and Computer Engineering, Politecnico di Torino. His current research interests include machine learning, speech processing and biometrics, in particular speaker and language recognition.

References (45)

  • E. Khoury et al.

    Hierarchical speaker clustering methods for the NIST i-vector challenge

    Odyssey: The Speaker and Language Recognition Workshop

    (2014)
  • D. Garcia-Romero et al.

    Unsupervised domain adaptation for i-vector speaker recognition

    Proc. of Odyssey 2014, The Speaker and Language Recognition Workshop

    (2014)
  • J. Hartigan et al.

    A k-means clustering algorithm

    J. R. Stat. Soc. Ser. C

    (1979)
  • A. Ng et al.

    On spectral clustering: analysis and an algorithm

    Proc. of Neural Information Processing Systems: Natural and Synthetic

    (2001)
  • U. Luxburg

    A tutorial on spectral clustering

    Stat. Comput.

    (2007)
  • D. Comaniciu et al.

    Mean shift: a robust approach toward feature space analysis

    IEEE Trans. Pattern Anal. Mach.Intell.

    (2002)
  • P. Franti et al.

    Fast and memory efficient implementation of the exact PNN

    IEEE Trans. Image Process.

    (2000)
  • Y. Zhao et al.

    Evaluation of hierarchical clustering algorithms for document datasets

    Proceedings of the Eleventh International Conference on Information and Knowledge Management

    (2002)
  • Y. Loewenstein et al.

    Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

    Bioinformatics

    (2008)
  • B. Everitt et al.

    Cluster Analysis

    (2011)
  • G. Sell et al.

    Speaker diarization with PLDA i-vector scoring and unsupervised calibration

    2014 IEEE Spoken Language Technology Workshop (SLT)

    (2014)
  • W. Day et al.

    Efficient algorithms for agglomerative hierarchical clustering methods

    J. Classification

    (1984)
  • Cited by (0)

    Sandro Cumani received the Ph.D. in computer and system engineering from Politecnico di Torino, Italy, in 2011. He is a Research Fellow in the Department of Control and Computer Engineering, Politecnico di Torino. His current research interests include machine learning, speech processing and biometrics, in particular speaker and language recognition.

    Pietro Laface is full Professor of Computer Science at Politecnico di Torino, Italy, where he leads the speech technology research group. He has published over 150 papers in the area of pattern recognition, artificial intelligence, and spoken language processing. His research interests include all aspects of automatic speech recognition and its applications.

    View full text