Skip to main content
Log in

Signature test as statistical testing in clustering

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

We propose a new statistical test denoted by signature testing (Sigtest) with the application in clustering and image classification. Sigtest relies on probabilistic validation of empirical distribution function of data. We implement Sigtest to estimate the number of clusters in hierarchical and partitional clustering. In addition, we propose a new adaptive estimation of the vocabulary size in image classification. Simulation results on both real and synthetic data confirm superiority of Sigtest over existing statistical tests in both hierarchical and partitional clustering as it estimates the number of clusters more accurately. Sigtest also shows advantageous in terms of adjusted rand index and variation of information. In addition, using Sigtest for adaptive choice of vocabulary size in bag of visual words improves the efficiency of the SVM classifier as well as reducing the time complexity of the overall algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Based on the Chebyshev inequality, for the confidence probability \(p_{c}\):

    $$\begin{aligned} Pr\{|g(V,z)-E[g(V,z)] |\le J\}=p_{c} \end{aligned}$$
    (7)

    we will have \(J=\alpha \sqrt{var[g(V,z)]}\), and it leads to the \(\alpha \le \sqrt{\frac{1}{1-p_{c}}}\) which gives the upper limit of \(\alpha \).

  2. Details for calculating \(F_{aj}(z)\) is provided in “Appendix 1”.

  3. The optimum choice of alpha and T has improved the result of Sigtest compare to [6] and [9].

References

  1. Shahbaba, M., Beheshti, S.: MACE-means clustering. J. Signal Process. 105, 216–225 (2014)

    Article  Google Scholar 

  2. Hamerly, G., Elkan, C.: Learning the K in K means. In: Neural Information Processing Systems, vol. 16, p. 281. MIT Press, Cambridge (2003)

  3. Hamerly, G., Feng, Y.: PG-means: learning the number of clusters in data. In: Advances in Neural Information Processing Systems: Proceedings of the 2006 Conference, vol. 19, p. 393. MIT Press, Cambridge (2007)

  4. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: CVPR (2006)

  5. Beheshti, S., Hashemi, M., Zhang, X.P., Nikvand, N.: Noise invalidation denoising. IEEE Trans. Signal Process. 58(12), 6007–6016 (2010)

    Article  MathSciNet  Google Scholar 

  6. Shahbaba, M., Beheshti, S.: Efficient unimodality test in clustering by signature testing. arXiv:1401.1895 [cs.LG] (2014)

  7. Shalizi, C.R.: Advanced Data Analysis from an Elementary Point of View, Ch. 19, Sec. 4, pp. 378–384. http://www.stat.cmu.edu/cshalizi/uADA/12/ (2012)

  8. Shahbaba, M., Beheshti, S.: Improving x-means clustering with mndl. In: 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA), pp. 1298–1302. IEEE (2012)

  9. Shahbaba, M., Beheshti, S.: Model verification of GMM clustering based on signature testing. In: Proceedings of 27th Canadian Conference on Electrical and Computer Engineering (CCECE) (2014) (to appear)

  10. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: IEEE, CVPR 2004, Workshop on Generative-Model Based Vision (2004)

  11. Cai, D., He, X., Han, J., Huang, T.: Graph regularized non-negative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1548–1560 (2011)

    Article  Google Scholar 

  12. Cai, D., He, X., Han, J.: Speed up kernel discriminant analysis. VLDB J. 20(1), 21–33 (2011)

    Article  Google Scholar 

  13. Kalogeratos, A., Likas, A.: Dip-means: an incremental clustering method for estimating the number of clusters. In: Advances in Neural Information Processing Systems, pp. 2393–2401. IEEE, New York (2012)

  14. Meil, M.: Comparing clusteringsan information based distance. J. Multivar. Anal. 98(5), 873–895 (2007)

    Article  Google Scholar 

  15. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mahdi Shahbaba.

Appendices

Appendix 1: Folded normal distribution

Let \(\theta \) be a sample of standard Gaussian distribution, \(\theta \sim \mathcal {N}(0,1)\), with density function \(\phi (\theta )\) as follows:

$$\begin{aligned} \phi (\theta ) = \frac{1}{\sqrt{2\pi }}e^{-\theta ^{2}/2}, \quad \theta \in R \end{aligned}$$
(11)

therefore, the distribution function of \(\theta \) can be given as follows:

$$\begin{aligned} \varPhi (\theta ) = \int _{-\infty }^{\theta } \phi (v) \hbox {d}v = \int _{-\infty }^{\theta } \frac{1}{\sqrt{2\pi }}e^{-v^{2}/2} \hbox {d}v, \theta \in R \end{aligned}$$
(12)

We let v be a sample of Gaussian distribution with mean \(\mu \) and standard deviation \(\sigma \), \(v \sim \mathcal {N}(\mu ,\sigma )\). Therefore, random variable V can be written as \(V = \mu + \sigma \varTheta \). It follows that \(W = |V| = |\mu + \sigma \varTheta |\) is a random variable with a folded normal distribution, where all of the negative values of V are folded to the positive region of the distribution.

Consequently, cdf of the sorted sample w, where \(w\in [0,\infty )\), will be as follows:

$$\begin{aligned} F_{a}(w)= & {} P(W\le w) = P(|V| \le w) = P(|\mu +\sigma \varTheta |\le w) \nonumber \\= & {} P(-w \le \mu +\sigma \varTheta \le w) \nonumber \\= & {} P\left( \frac{-w-\mu }{\sigma } \le w \le \frac{w - \mu }{\sigma }\right) \nonumber \\= & {} \varPhi \left( \frac{w-\mu }{\sigma }\right) - \varPhi \left( \frac{-w-\mu }{\sigma }\right) \nonumber \\ \end{aligned}$$
(13)

Since \(\varPhi (-\theta )=1-\varPhi (\theta )\), we will have:

$$\begin{aligned} F_{a}(w)= & {} \varPhi \left( \frac{w-\mu }{\sigma }\right) - \varPhi \left( \frac{-w-\mu }{\sigma }\right) \nonumber \\= & {} \varPhi \left( \frac{w-\mu }{\sigma }\right) + \varPhi \left( \frac{w+\mu }{\sigma }\right) - 1 \nonumber \\= & {} \int _{0}^{w}\frac{1}{\sigma \sqrt{(}2\pi )} \left\{ \exp {\left[ -\frac{1}{2}\left( \frac{v+\mu }{\sigma }\right) ^{2}\right] }\right. \nonumber \\&\left. +\exp {\left[ -\frac{1}{2}\left( \frac{v-\mu }{\sigma }\right) ^{2}\right] }\right\} \hbox {d}v\nonumber \\ \end{aligned}$$
(14)

In case of Gaussian mixture models, the above cdf will be calculated as follows [7]:

$$\begin{aligned} F_{a}(z)=\sum _{j=1}^{K}\pi _{j} F_{aj}(z) \end{aligned}$$
(15)

where \(F_{aj}(z)\) is the Gaussian cdf of the jth component and \(\pi _{j}\) is the mixing factor of that component in the mixture.

Fig. 8
figure 8

Estimated \(\alpha \) and T parameters using Genetic algorithm for different distances between clusters

Appendix 2: Estimation of \(\alpha \) and T

For each combination of \(\alpha \) and T, similarity score \(A(\alpha , T)\) can be calculate using (9):

$$\begin{aligned} A(\alpha , T) = {\left\{ \begin{array}{ll} 1 \quad Sigtest_{score}(\alpha ) < T (H_{0})\\ 0 \quad Sigtest_{score}(\alpha ) \ge T (H_{1}) \end{array}\right. } \end{aligned}$$
(16)

Consider \(A(\alpha , T)\) for two different scenarios: (1) \(A_{0}(\alpha , T)\), where ecdf of the data is related to a single cluster and Dcdf is a single Gaussian as well, (2) \(A_{1}(\alpha , T)\), where ecdf of the data is relate to two overlapped clusters and Dcdf is a single Gaussian. A proper combination of \(\alpha \) and T should suggest to split the two clusters, and should not split the single cluster.

The following minimization problem can be used to estimate the proper \(\alpha \) and T:

$$\begin{aligned}{}[\hat{\alpha }, \hat{T}] = \min _{\alpha , T}[A_{1}(\alpha , T) - A_{0}(\alpha , T)] \end{aligned}$$
(17)

where \(0<T\le 1\) and \(0<\alpha \le \sqrt{\frac{1}{1-p_{c}}}\) [i.e., \(p_{c} = 0.99\) in (6)] are possible constraints for the minimization problem. To solve the (17), we have employed genetic algorithm (GA) with a population size of 100. Figure 8 shows the result of GA simulations for estimating \(\alpha \) and T for different distances between the center of clusters. Consequently, to be in a safe range for the optimum behavior of Sigtest, we set the averaged values of 1.72 and 0.53 for \(\alpha \) and T, respectively. In the case of Gaussian mixture models, the T value will be adaptively calculated based on the assumed mixing factor \(\pi _{j}\) in (15). As a result, the T value in a mixture model will be give as follows:

$$\begin{aligned} T = 0.53 \max _{j}{\pi _{j}} \end{aligned}$$
(18)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shahbaba, M., Beheshti, S. Signature test as statistical testing in clustering. SIViP 10, 1343–1351 (2016). https://doi.org/10.1007/s11760-016-0926-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-016-0926-1

Keywords

Navigation