Abstract
We propose the novel approach of dynamic affine-invariant shape-appearance model (Aff-SAM) and employ it for handshape classification and sign recognition in sign language (SL) videos. Aff-SAM offers a compact and descriptive representation of hand configurations as well as regularized model-fitting, assisting hand tracking and extracting handshape features. We construct SA images representing the hand’s shape and appearance without landmark points. We model the variation of the images by linear combinations of eigenimages followed by affine transformations, accounting for 3D hand pose changes and improving model’s compactness. We also incorporate static and dynamic handshape priors, offering robustness in occlusions, which occur often in signing. The approach includes an affine signer adaptation component at the visual level, without requiring training from scratch a new singer-specific model. We rather employ a short development data set to adapt the models for a new signer. Experiments on the Boston-University-400 continuous SL corpus demonstrate improvements on handshape classification when compared to other feature extraction approaches. Supplementary evaluations of sign recognition experiments, are conducted on a multi-signer, 100-sign data set, from the Greek sign language lemmas corpus. These explore the fusion with movement cues as well as signer adaptation of Aff-SAM to multiple signers providing promising results.
Editors: Isabelle Guyon and Vassilis Athitsos
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The existence of two such transforms is due to the modulo-\(\pi \) ambiguity of the orientation.
- 2.
We have assumed that the parameters \(\varvec{\lambda }\) and \(\varvec{p}\) are statistically independent.
- 3.
The \((K+1)\)-plets follow from the fact that we need K neighbouring samples + the current sample.
References
U. Agris, J. Zieren, U. Canzler, B. Bauer, K.F. Kraiss, Recent developments in visual sign language recognition. Univ. Access Inf. Soc. 6, 323–362 (2008)
T. Ahmad, C.J. Taylor, T.F. Lanitis, A. Cootes, Tracking and recognising hand gestures, using statistical shape models. Image Vis. Comput. 15(5), 345–352 (1997)
A. Argyros, M. Lourakis, Real time tracking of multiple skin-colored objects with a possibly moving camera, in Proceedings of the European Conference on Computer Vision, 2004
V. Athitsos, S. Sclaroff, An appearance-based framework for 3d hand shape classification and camera viewpoint estimation, in Proceedings of the International Conference on Automatic Face and Gesture Recognition, 2002, pp. 45–52
S. Baker, I. Matthews, Lucas-kanade 20 years on: a unifying framework: Part 1. Technical report, Carnegie Mellon University, 2002
S. Baker, R. Gross, I. Matthews, Lucas-kanade 20 years on: a unifying framework: Part 4, Technical report, Carnegie Mellon University, 2004
B. Bauer, K.F. Kraiss, Towards an automatic sign language recognition system using subunits, in Proceedings of the International Gesture Workshop vol. 2298, 2001, pp. 64–75
H. Birk, T.B. Moeslund, C.B. Madsen, Real-time recognition of hand alphabet gestures using principal component analysis, in Proceedings of the Scandinavian Conference Image Analysis, 1997
A. Blake, M. Isard, Active Contours (Springer, 1998)
R. Bowden, M. Sarhadi, A nonlinear model of shape and motion for tracking fingerspelt american sign language. Image Vis. Comput. 20, 597–607 (2002)
P. Buehler, M. Everingham, A. Zisserman, Learning sign language by watching TV (using weakly aligned subtitles), in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2009
J. Cai, A. Goshtasby, Detecting human faces in color images. Image Vis. Comput. 18, 63–75 (1999)
F.-S. Chen, C.-M. Fu, C.-L. Huang, Hand gesture recognition using a real-time tracking method and hidden markov models. Image Vis. Comput. 21(8), 745–758 (2003)
S. Conseil, S. Bourennane, L. Martin, Comparison of Fourier descriptors and Hu moments for hand posture recognition, in Proceedings of the European Conference on Signal Processing, 2007
T.F. Cootes, C.J. Taylor, Statistical models of appearance for computer vision. Technical report, University of Manchester, 2004
Y. Cui, J. Weng, Appearance-based hand sign recognition from intensity image sequences. Comput. Vis. Image Underst. 78(2), 157–176 (2000)
D.L. Davies, D.W. Bouldin, A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227 (1979)
J.-W. Deng, H.T. Tsui, A novel two-layer PCA/MDA scheme for hand posture recognition. Proc. Int. Conf. Pattern Recognit. 1, 283–286 (2002)
DictaSign, Greek sign language corpus, http://www.sign-lang.uni-hamburg.de/dicta-sign/portal, 2012
L. Ding, A.M. Martinez, Modelling and recognition of the linguistic components in american sign language. Image Vis. Comput. 27(12), 1826–1844 (2009)
P. Dreuw, J. Forster, T. Deselaers, H. Ney, Efficient approximations to model-based joint tracking and recognition of continuous sign language, in Proceedings of the International Conference on Automatic Face and Gesture Recognition, 2008
I.L. Dryden, K.V. Mardia, Statistical Shape Analysis (Wiley, 1998)
W. Du, J. Piater, Hand modeling and tracking for video-based sign language recognition by robust principal component analysis, in Proceedings of the ECCV Workshop on Sign, Gesture and Activity, 2010
A. Farhadi, D. Forsyth, R. White, Transfer learning in sign language, in Proceedings of the Conference on Computer Vision and Pattern Recognition (IEEE, 2007), pp. 1–8
H. Fillbrandt, S. Akyol, K.-F. Kraiss, Extraction of 3D hand shape and posture from images sequences from sign language recognition, in Proceedings of the International Workshop on Analysis and Modeling of Faces and Gestures, 2003, pp. 181–186
R. Gross, I. Matthews, S. Baker, Generic vs. person specific active appearance models. Image Vis. Comput. 23(12), 1080–1093 (2005)
T. Hanke, HamNoSys Representing sign language data in language resources and language processing contexts, in Proceedings of the International Conference on Language Resources and Evaluation, 2004
M.-K. Hu, Visual pattern recognition by moment invariants. IEEE Trans. Inf. Theory 8(2), 179–187 (1962)
C.-L. Huang, S.-H. Jeng, A model-based hand gesture recognition system. Mach. Vis. Appl. 12(5), 243–258 (2001)
P. Kakumanu, S. Makrogiannis, N. Bourbakis, A survey of skin-color modeling and detection methods. Pattern Recogn. 40(3), 1106–1122 (2007)
E. Learned-Miller, Data driven image models through continuous joint alignment. IEEE Trans. Pattern Anal. Mach. Intell. 28(2), 236–250 (2005)
S. Liwicki, M. Everingham, Automatic recognition of fingerspelled words in British sign language, in Proceedings of the CVPR Workshop on Human Communicative Behavior Analysis, 2009
P. Maragos, Morphological Filtering for Image Enhancement and Feature Detection, The Image and Video Processing Handbook (Elsevier, 2005)
I. Matthews, S. Baker, Active appearance models revisited. Int. J. Comput. Vis. 60(2), 135–164 (2004)
C. Neidle, Signstream annotation: addendum to conventions used for the american sign language linguistic research project. Technical report, 2007
C. Neidle, C. Vogler, A new web interface to facilitate access to corpora: development of the ASLLRP data access interface, in Proceedings of the International Conference on Language Resources and Evaluation, 2012
E.J. Ong, H. Cooper, N. Pugeault, R. Bowden, Sign language recognition using sequential pattern trees, in Proceedings of the Conference on Computer Vision and Pattern Recognition (IEEE, 2012), pp. 2200–2207
Y. Peng, A. Ganesh, J. Wright, W. Xu, Y. Ma, RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images, in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2010
V. Pitsikalis, S. Theodorakis, P. Maragos, Data-driven sub-units and modeling structure for continuous sign language recognition with multiple cues, in LREC Workshop Repr. & Proc. SL, Corpora and SL Technologies, 2010
L.R. Rabiner, R.W. Schafer, Introduction to digital speech processing. Found. Trends Signal Process. 1(1–2), 1–194 (2007)
A. Roussos, S. Theodorakis, V. Pitsikalis, P. Maragos, Affine-invariant modeling of shape-appearance images applied on sign language handshape classification, in Proceedings of the International Conference on Image Processing, 2010a
A. Roussos, S. Theodorakis, V. Pitsikalis, P. Maragos, Hand tracking and affine shape-appearance handshape sub-units in continuous sign language recognition, in Proceedings of the ECCV Workshop on Sign, Gesture and Activity, 2010b
J. Sherrah, S. Gong, Resolving visual uncertainty and occlusion through probabilistic reasoning, in Proceedings of the British Machine Vision Conference, 2000, pp. 252–261
P. Soille, Morphological Image Analysis: Principles and Applications (Springer, 2004)
T. Starner, J. Weaver, A. Pentland, Real-time american sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. 20(12), 1371–1375 (1998). Dec
B. Stenger, A. Thayananthan, P.H.S. Torr, R. Cipolla, Model-based hand tracking using a hierarchical bayesian filter. IEEE Trans. Pattern Anal. Mach. Intell. 28(9), 1372–1384 (2006). Sep
G.J. Sweeney, A.C. Downton, Towards appearance-based multi-channel gesture recognition, in Proceedings of the International Gesture Workshop, 1996, pp. 7–16
N. Tanibata, N. Shimada, Y. Shirai, Extraction of hand features for recognition of sign language words, in Proceedings of the International Conference on Vision Interface, 2002, pp. 391–398
J. Terrillon, M. Shirazi, H. Fukamachi, S. Akamatsu, Comparative performance of different skin chrominance models and chrominance spaces for the automatic detection of human faces in color images, in Proceedings of the International Conference on Automatic Face and Gesture Recognition, 2000, pp. 54–61
A. Thangali, J.P. Nash, S. Sclaroff, C. Neidle, Exploiting phonological constraints for handshape inference in asl video, in Proceedings of the Conference on Computer Vision and Pattern Recognition (IEEE, 2011) pp. 521–528
S. Theodorakis, V. Pitsikalis, P. Maragos, Advances in dynamic-static integration of movement and handshape cues for sign language recognition, in Proceedings of the International Gesture Workshop, 2011
S. Theodorakis, V. Pitsikalis, I. Rodomagoulakis, P. Maragos, Recognition with raw canonical phonetic movement and handshape subunits on videos of continuous sign language, in Proceedings of the International Conference on Image Processing, 2012
M. Viola, M.J. Jones, Fast multi-view face detection, in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2003
C. Vogler, D. Metaxas, Parallel hidden markov models for american sign language recognition. Proc. Int. Conf. Comput. Vis. 1, 116–122 (1999)
Y. Wu, T.S. Huang, View-independent recognition of hand postures. Proc. Conf. Comput. Vis. Pattern Recognit. 2, 88–94 (2000)
M.-H. Yang, N. Ahuja, M. Tabb, Extraction of 2d motion trajectories and its application to hand gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1061–1074 (2002). Aug
S. Young, D. Kershaw, J. Odell, D. Ollason, V. Woodland, P. Valtchevand, The HTK Book (Entropic Ltd., 1999)
J. Zieren, N. Unger, S. Akyol, Hands tracking from frontal view for vision-based gesture recognition, in Pattern Recognition, LNCS, 2002, pp. 531–539
Acknowledgements
This research work was supported by the EU under the research program Dictasign with Grant FP7-ICT-3-231135. A. Roussos was also supported by the ERC Starting Grant 204871-HUMANIS. This work was done while A. Roussos was with National Technical University of Athens, Greece and Queen Mary, University of London, UK; S. Theodorakis and V. Pitsikalis were both with the National Technical University of Athens; S. Theodorakis and V. Pitsikalis are now with deeplab.ai, Athens, GR.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix A. Details about the Regularized Fitting Algorithm
Appendix A. Details about the Regularized Fitting Algorithm
We provide here details about the algorithm of the regularized fitting of the shape-appearance model. The total energy \(E(\varvec{\lambda },\varvec{p})\) that is to be minimized can be written as (after a multiplication with \(N_M\) that does not affect the optimum parameters):
If \(\sigma _{{\lambda }_i}\), \(\sigma _{\tilde{p}_i}\) are the standard deviations of the components of the parameters \(\varvec{\lambda }\), \(\widetilde{\varvec{p}}\) respectively and \(\sigma _{{\varepsilon }_{\varvec{\lambda },i}}\), \(\sigma _{{\varepsilon }_{\widetilde{\varvec{p}},i}}\) are the standard deviations of the components of the parameters’ prediction errors \(\varvec{\varepsilon }_{\varvec{\lambda }}\), \(\varvec{\varepsilon }_{\widetilde{\varvec{p}}}\), then the corresponding covariance matrices \(\Sigma _{\varvec{\lambda }}\), \(\Sigma _{\widetilde{\varvec{p}}}\), \(\Sigma _{\varvec{\varepsilon }_{\varvec{\lambda }}}\), \(\Sigma _{\varvec{\varepsilon }_{\widetilde{\varvec{p}}}}\), which are diagonal, can be written as:
The squared norms of the prior terms in Eq. (8.4) are thus given by:
Therefore, if we set:
the energy in Eq. (8.4) takes the form:
with \(G_i(\varvec{\lambda },\varvec{p})\) being \(N_G=2 N_c + 2 N_p\) prior functions defined by:
Each component \(\tilde{p}_j\), \(j=1,\ldots ,N_p\), of the re-parametrization of \(\varvec{p}\) can be written as:
where \(\varvec{v}_{\tilde{p}_j}\) is the j-th column of \(U_{\varvec{p}}\), that is the eigenvector of the covariance matrix \(\Sigma _{\varvec{p}}\) that corresponds to the j-th principal component \(\tilde{p}_j\).
In fact, the energy \(J(\varvec{\lambda },\varvec{p})\), Eq. (8.5), for general prior functions \(G_i(\varvec{\lambda },\varvec{p})\), has exactly the same form as the energy that is minimized by the algorithm of Baker et al. (2004). Next, we describe this algorithm and then we specialize it in the specific case of our framework.
8.1.1 A.1. Simultaneous Inverse Compositional Algorithm with a Prior
We briefly present here the algorithm simultaneous inverse compositional with a prior (SICP) (Baker et al. 2004). This is a Gauss-Newton algorithm that finds a local minimum of the energy \(J(\varvec{\lambda },\varvec{p})\) (8.5) for general cases of prior functions \(G_i(\varvec{\lambda },\varvec{p})\) and warps \(W_{\varvec{p}}(\varvec{x})\) that are controlled by some parameters \(\varvec{p}\).
The algorithm starts from some initial estimates of \(\varvec{\lambda }\) and \(\varvec{p}\). Afterwards, in every iteration, the previous estimates of \(\varvec{\lambda }\) and \(\varvec{p}\) are updated to \(\varvec{\lambda }'\) and \(\varvec{p}'\) as follows. It is considered that a vector \(\Delta \varvec{\lambda }\) is added to \(\varvec{\lambda }\):
and a warp with parameters \(\Delta \varvec{p}\) is applied to the synthesized image \(A_0(\varvec{x}) + \sum \lambda _i A_i(\varvec{x})\). As an approximation, the latter is taken as equivalent to updating the warp parameters from \(\varvec{p}\) to \(\varvec{p}'\) by composing \(W_{\varvec{p}}(\varvec{x})\) with the inverse of \(W_{\Delta \varvec{p}}(\varvec{x})\) :
From the above relation, given that \(\varvec{p}\) is constant, \(\varvec{p}'\) can be expressed as a \(\mathbb {R}^{N_p} \rightarrow \mathbb {R}^{N_p}\) function of \(\Delta \varvec{p}\), \(\varvec{p}'=\varvec{p}'(\Delta \varvec{p})\), with \(\varvec{p}'(\Delta \varvec{p}=0) = \varvec{p}\). Further, \(\varvec{p}'(\Delta \varvec{p})\) is approximated with a first order Taylor expansion around \(\Delta p = 0\):
where \(\frac{\partial \varvec{p}'}{\partial \Delta \varvec{p}}\) is the Jacobian of the function \(\varvec{p}'(\Delta \varvec{p})\), which generally depends on \(\Delta \varvec{p}\).
Based on the aforementioned type of updates of \(\varvec{\lambda }\) and \(\varvec{p}\) as well as the considered approximations, the values \(\Delta \varvec{\lambda }\) and \(\Delta \varvec{p}\) are specified by minimizing the following energy:
simultaneously with respect to \(\Delta \varvec{\lambda }\) and \(\Delta \varvec{p}\). By applying first order Taylor approximations on the two terms of the above energy \(F(\varvec{\lambda },\varvec{p})\), one gets:
where \(E_{sim}(\varvec{x})\) is the image of reconstruction error evaluated at the model domain:
and \(\varvec{SD}_{sim}(\varvec{x})\) is a vector-valued “steepest descent” image with \(N_c+N_{\varvec{p}}\) channels, each one of them corresponding to a specific component of the parameter vectors \(\varvec{\lambda }\) and \(\varvec{p}\):
where the gradients \(\nabla A_i(\varvec{x})=\left[ \frac{\partial A_i}{\partial x_1} \,,\, \frac{\partial A_i}{\partial x_2} \right] \) are considered as row vector functions. Also \(\varvec{SD}_{G_i}\), for each \(i=1,...,N_G\), is a row vector with dimension \(N_c+N_{\varvec{p}}\) that corresponds to the steepest descent direction of the prior term \(G_i(\varvec{\lambda },\varvec{p})\):
The approximated energy \(F(\varvec{\lambda },\varvec{p})\) (8.11) is quadratic with respect to both \(\Delta \varvec{\lambda }\) and \(\Delta \varvec{p}\), therefore the minimization can be done analytically and leads to the following solution:
where H is the matrix (which approximates the Hessian of F):
In conclusion, in every iteration of the SICP algorithm, the Eq. (8.14) is applied and the parameters \(\varvec{\lambda }\) and \(\varvec{p}\) are updated using Eqs. (8.8) and (8.10). This process terminates when a norm of the update vector \( \left( \begin{array}{c} \Delta \varvec{\lambda } \\ \Delta \varvec{p} \\ \end{array}\right) \) falls below a relatively small threshold and then it is considered that the process has converged.
8.1.1.1 A.1.1. Combination with Levenberg-Marquardt Algorithm
In the algorithm described above, there is no guarantee that the original energy (8.5), that is the objective function before any approximation, decreases in every iteration; it might increase if the involved approximations are not accurate. Therefore, following Baker and Matthews (2002), we use a modification of this algorithm by combining it with the Levenberg-Marquardt algorithm: In Eq. (8.14) that specifies the updates, we replace the Hessian approximation H by \(H+\delta \, \mathrm {diag}(H)\), where \(\delta \) is a positive weight and \(\mathrm {diag}(H)\) is the diagonal matrix that contains the diagonal elements of H. This corresponds to an interpolation between the updates given by the Gauss-Newton algorithm and weighted gradient descent. As \(\delta \) increases, the algorithm has a behavior closer to gradient descent, which means that from the one hand is slower but from the other hand yields updates that are more reliable, in the sense that the energy will eventually decrease for sufficiently large \(\delta \).
In every iteration, we specify the appropriate weight \(\delta \) as follows. Starting from setting \(\delta \) to 1 / 10 of its value in the previous iteration (or from \(\delta =0.01\) if this is the first iteration), we compute the updates \(\Delta \varvec{\lambda }\) and \(\Delta \varvec{p}\) using the Hessian approximation \(H+\delta \, \mathrm {diag}(H)\) and then evaluate the original energy (8.5). If the energy has decreased we keep the updates and finish the iteration. If the energy has increased, we set \(\delta \rightarrow 10\,\delta \) and try again. We repeat that step until the energy decreases.
8.1.2 A.2. Specialization in the Current Framework
In this section, we derive the SICP algorithm for the special case that concerns our method. This case arises when (1) the general warps \(W_{\varvec{p}}(\varvec{x})\) are specialized to affine transforms and (2) the general prior functions \(G_i(\varvec{\lambda },\varvec{p})\) are given by Eq. (8.6).
8.1.2.1 A.2.1. The Case of Affine Transforms
In our framework, the general warps \(W_{\varvec{p}}(\varvec{x})\) of the SICP algorithm are specialized to affine transforms with parameters \(\varvec{p}=(p_1\cdots p_6)\) that are defined by:
In this special case, which is analyzed also in Baker et al. (2004), the Jacobian \(\frac{\partial W_p(\varvec{x})}{\partial \varvec{p}}\) that is used in Eq. (8.12) is given by:
The restriction to affine transforms implies also a special form for the Jacobian \(\frac{\partial \varvec{p}'}{\partial \Delta \varvec{p}}\) that is used in Eq. (8.13). More precisely, as described in Baker et al. (2004), a first order Taylor approximation is first applied to the inverse warp \(W_{\Delta \varvec{p}}^{-1}\) and yields \(W_{\Delta \varvec{p}}^{-1} \approx W_{-\Delta \varvec{p}}\). Afterwards, based on Eq. (8.9) and the fact that the parameters of a composition \(W_{\varvec{r}} = W_{\varvec{p}} \circ W_{\varvec{q}}\) of two affine transforms are given by:
the function \(\varvec{p}'(\Delta \varvec{p})\) (8.10) is approximated as:
Therefore, its Jacobian is given by:
8.1.2.2 A.2.2. Specific Type of Prior Functions
Apart from the restriction to affine transforms, in the proposed framework of the regularized shape-appearance model fitting, we have derived the specific formulas of Eq. (8.6) for the prior functions \(G_i(\varvec{\lambda },\varvec{p})\) of the energy \(J(\varvec{\lambda },\varvec{p})\) in Eq. (8.5). Therefore, in our case, their partial derivatives, which are involved in the above described SICP algorithm (see Eq. (8.13)), are specialized as follows:
where \(\varvec{e}_i\), \(1 \le i \le N_c\), is the ith column of the \(N_c \times N_c\) identity matrix.
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Roussos, A., Theodorakis, S., Pitsikalis, V., Maragos, P. (2017). Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos. In: Escalera, S., Guyon, I., Athitsos, V. (eds) Gesture Recognition. The Springer Series on Challenges in Machine Learning. Springer, Cham. https://doi.org/10.1007/978-3-319-57021-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-57021-1_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57020-4
Online ISBN: 978-3-319-57021-1
eBook Packages: Computer ScienceComputer Science (R0)