Signature test as statistical testing in clustering

Shahbaba, Mahdi; Beheshti, Soosan

doi:10.1007/s11760-016-0926-1

Signature test as statistical testing in clustering

Original Paper
Published: 14 July 2016

Volume 10, pages 1343–1351, (2016)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Mahdi Shahbaba¹ &
Soosan Beheshti¹

195 Accesses
6 Citations
Explore all metrics

Abstract

We propose a new statistical test denoted by signature testing (Sigtest) with the application in clustering and image classification. Sigtest relies on probabilistic validation of empirical distribution function of data. We implement Sigtest to estimate the number of clusters in hierarchical and partitional clustering. In addition, we propose a new adaptive estimation of the vocabulary size in image classification. Simulation results on both real and synthetic data confirm superiority of Sigtest over existing statistical tests in both hierarchical and partitional clustering as it estimates the number of clusters more accurately. Sigtest also shows advantageous in terms of adjusted rand index and variation of information. In addition, using Sigtest for adaptive choice of vocabulary size in bag of visual words improves the efficiency of the SVM classifier as well as reducing the time complexity of the overall algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Incremental Estimation of Visual Vocabulary Size for Image Retrieval

Unsupervised Visual Object Categorisation with BoF and Spatial Matching

A Novel Method for Scene Categorization Using an Improved Visual Vocabulary Approach

Notes

Based on the Chebyshev inequality, for the confidence probability $p_{c}$:
$$\begin{aligned} Pr\{|g(V,z)-E[g(V,z)] |\le J\}=p_{c} \end{aligned}$$
(7)
we will have $J=\alpha \sqrt{var[g(V,z)]}$, and it leads to the $\alpha \le \sqrt{\frac{1}{1-p_{c}}}$ which gives the upper limit of $\alpha $.
Details for calculating $F_{aj}(z)$ is provided in “Appendix 1”.
The optimum choice of alpha and T has improved the result of Sigtest compare to [6] and [9].

References

Shahbaba, M., Beheshti, S.: MACE-means clustering. J. Signal Process. 105, 216–225 (2014)
Article Google Scholar
Hamerly, G., Elkan, C.: Learning the K in K means. In: Neural Information Processing Systems, vol. 16, p. 281. MIT Press, Cambridge (2003)
Hamerly, G., Feng, Y.: PG-means: learning the number of clusters in data. In: Advances in Neural Information Processing Systems: Proceedings of the 2006 Conference, vol. 19, p. 393. MIT Press, Cambridge (2007)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: CVPR (2006)
Beheshti, S., Hashemi, M., Zhang, X.P., Nikvand, N.: Noise invalidation denoising. IEEE Trans. Signal Process. 58(12), 6007–6016 (2010)
Article MathSciNet Google Scholar
Shahbaba, M., Beheshti, S.: Efficient unimodality test in clustering by signature testing. arXiv:1401.1895 [cs.LG] (2014)
Shalizi, C.R.: Advanced Data Analysis from an Elementary Point of View, Ch. 19, Sec. 4, pp. 378–384. http://www.stat.cmu.edu/cshalizi/uADA/12/ (2012)
Shahbaba, M., Beheshti, S.: Improving x-means clustering with mndl. In: 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA), pp. 1298–1302. IEEE (2012)
Shahbaba, M., Beheshti, S.: Model verification of GMM clustering based on signature testing. In: Proceedings of 27th Canadian Conference on Electrical and Computer Engineering (CCECE) (2014) (to appear)
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: IEEE, CVPR 2004, Workshop on Generative-Model Based Vision (2004)
Cai, D., He, X., Han, J., Huang, T.: Graph regularized non-negative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1548–1560 (2011)
Article Google Scholar
Cai, D., He, X., Han, J.: Speed up kernel discriminant analysis. VLDB J. 20(1), 21–33 (2011)
Article Google Scholar
Kalogeratos, A., Likas, A.: Dip-means: an incremental clustering method for estimating the number of clusters. In: Advances in Neural Information Processing Systems, pp. 2393–2401. IEEE, New York (2012)
Meil, M.: Comparing clusteringsan information based distance. J. Multivar. Anal. 98(5), 873–895 (2007)
Article Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Electrical Engineering, Ryerson University, 350 Victoria Street, Toronto, M5B 2K3, ON, Canada
Mahdi Shahbaba & Soosan Beheshti

Authors

Mahdi Shahbaba
View author publications
You can also search for this author in PubMed Google Scholar
Soosan Beheshti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mahdi Shahbaba.

Appendices

Appendix 1: Folded normal distribution

Let $\theta $ be a sample of standard Gaussian distribution, $\theta \sim \mathcal {N}(0,1)$, with density function $\phi (\theta )$ as follows:

$$\begin{aligned} \phi (\theta ) = \frac{1}{\sqrt{2\pi }}e^{-\theta ^{2}/2}, \quad \theta \in R \end{aligned}$$

(11)

therefore, the distribution function of $\theta $ can be given as follows:

$$\begin{aligned} \varPhi (\theta ) = \int _{-\infty }^{\theta } \phi (v) \hbox {d}v = \int _{-\infty }^{\theta } \frac{1}{\sqrt{2\pi }}e^{-v^{2}/2} \hbox {d}v, \theta \in R \end{aligned}$$

(12)

We let v be a sample of Gaussian distribution with mean $\mu $ and standard deviation $\sigma $, $v \sim \mathcal {N}(\mu ,\sigma )$. Therefore, random variable V can be written as $V = \mu + \sigma \varTheta $. It follows that $W = |V| = |\mu + \sigma \varTheta |$ is a random variable with a folded normal distribution, where all of the negative values of V are folded to the positive region of the distribution.

Consequently, cdf of the sorted sample w, where $w\in [0,\infty )$, will be as follows:

$$\begin{aligned} F_{a}(w)= & {} P(W\le w) = P(|V| \le w) = P(|\mu +\sigma \varTheta |\le w) \nonumber \\= & {} P(-w \le \mu +\sigma \varTheta \le w) \nonumber \\= & {} P\left( \frac{-w-\mu }{\sigma } \le w \le \frac{w - \mu }{\sigma }\right) \nonumber \\= & {} \varPhi \left( \frac{w-\mu }{\sigma }\right) - \varPhi \left( \frac{-w-\mu }{\sigma }\right) \nonumber \\ \end{aligned}$$

(13)

Since $\varPhi (-\theta )=1-\varPhi (\theta )$, we will have:

$$\begin{aligned} F_{a}(w)= & {} \varPhi \left( \frac{w-\mu }{\sigma }\right) - \varPhi \left( \frac{-w-\mu }{\sigma }\right) \nonumber \\= & {} \varPhi \left( \frac{w-\mu }{\sigma }\right) + \varPhi \left( \frac{w+\mu }{\sigma }\right) - 1 \nonumber \\= & {} \int _{0}^{w}\frac{1}{\sigma \sqrt{(}2\pi )} \left\{ \exp {\left[ -\frac{1}{2}\left( \frac{v+\mu }{\sigma }\right) ^{2}\right] }\right. \nonumber \\&\left. +\exp {\left[ -\frac{1}{2}\left( \frac{v-\mu }{\sigma }\right) ^{2}\right] }\right\} \hbox {d}v\nonumber \\ \end{aligned}$$

(14)

In case of Gaussian mixture models, the above cdf will be calculated as follows [7]:

$$\begin{aligned} F_{a}(z)=\sum _{j=1}^{K}\pi _{j} F_{aj}(z) \end{aligned}$$

(15)

where $F_{aj}(z)$ is the Gaussian cdf of the jth component and $\pi _{j}$ is the mixing factor of that component in the mixture.

Appendix 2: Estimation of $\alpha $ and T

For each combination of $\alpha $ and T, similarity score $A(\alpha , T)$ can be calculate using (9):

$$\begin{aligned} A(\alpha , T) = {\left\{ \begin{array}{ll} 1 \quad Sigtest_{score}(\alpha ) < T (H_{0})\\ 0 \quad Sigtest_{score}(\alpha ) \ge T (H_{1}) \end{array}\right. } \end{aligned}$$

(16)

Consider $A(\alpha , T)$ for two different scenarios: (1) $A_{0}(\alpha , T)$, where ecdf of the data is related to a single cluster and Dcdf is a single Gaussian as well, (2) $A_{1}(\alpha , T)$, where ecdf of the data is relate to two overlapped clusters and Dcdf is a single Gaussian. A proper combination of $\alpha $ and T should suggest to split the two clusters, and should not split the single cluster.

The following minimization problem can be used to estimate the proper $\alpha $ and T:

$$\begin{aligned}{}[\hat{\alpha }, \hat{T}] = \min _{\alpha , T}[A_{1}(\alpha , T) - A_{0}(\alpha , T)] \end{aligned}$$

(17)

where $0<T\le 1$ and $0<\alpha \le \sqrt{\frac{1}{1-p_{c}}}$ [i.e., $p_{c} = 0.99$ in (6)] are possible constraints for the minimization problem. To solve the (17), we have employed genetic algorithm (GA) with a population size of 100. Figure 8 shows the result of GA simulations for estimating $\alpha $ and T for different distances between the center of clusters. Consequently, to be in a safe range for the optimum behavior of Sigtest, we set the averaged values of 1.72 and 0.53 for $\alpha $ and T, respectively. In the case of Gaussian mixture models, the T value will be adaptively calculated based on the assumed mixing factor $\pi _{j}$ in (15). As a result, the T value in a mixture model will be give as follows:

$$\begin{aligned} T = 0.53 \max _{j}{\pi _{j}} \end{aligned}$$

(18)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shahbaba, M., Beheshti, S. Signature test as statistical testing in clustering. SIViP 10, 1343–1351 (2016). https://doi.org/10.1007/s11760-016-0926-1

Download citation

Received: 03 July 2015
Revised: 05 March 2016
Accepted: 15 June 2016
Published: 14 July 2016
Issue Date: October 2016
DOI: https://doi.org/10.1007/s11760-016-0926-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Signature test as statistical testing in clustering

Abstract

Access this article

Similar content being viewed by others

Incremental Estimation of Visual Vocabulary Size for Image Retrieval

Unsupervised Visual Object Categorisation with BoF and Spatial Matching

A Novel Method for Scene Categorization Using an Improved Visual Vocabulary Approach

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Folded normal distribution

Appendix 2: Estimation of \(\alpha \) and T

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Signature test as statistical testing in clustering

Abstract

Access this article

Similar content being viewed by others

Incremental Estimation of Visual Vocabulary Size for Image Retrieval

Unsupervised Visual Object Categorisation with BoF and Spatial Matching

A Novel Method for Scene Categorization Using an Improved Visual Vocabulary Approach

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Folded normal distribution

Appendix 2: Estimation of \(\alpha \) and T

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation