Skip to main content

Speaker Dependent Visual Speech Recognition by Symbol and Real Value Assignment

  • Chapter
Robot Intelligence Technology and Applications 2012

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 208))

Abstract

In this paper, we propose a visual speech recognition method using symbol or real value assignment. Our method is inspired by Bag of Word (BoW) [1] model which is usually applied to an object matching problem. In the BoW model, a codebook is produced by using K-means clustering, and a feature vector extracted from an image is converted to corresponding symbol. Similarly, we generate codebook by running K-means algorithm on a pool of pHog (Pyramid Histogram of Oriented Gradients) feature vectors extracted from a subset of lip database. Then, the remaining lip images are assigned a particular value after comparing the chi-square distance to each cluster. Based on the type of this value, two methods are suggested so as to assign the value to a lip image frame. The first method is to find the cluster whose element image has the minimum chi square distance to the processing frame, and assign the cluster label to the frame. Second one is to calculate the distances between the frame and all cluster’s centroids, obtain multi-dimensional vector for the frame which directly becomes an assigned value for the frame. Following these methods, each time sequence is converted into symbolized or multi-dimensional real valued sequence. To measure the similarity between two time sequences, we use Dynamic Time Warping for real valued time sequence and Edit distance for symbolized sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Sivic, J., Zisserman, A.: Video Google: A Text Retrieval Approach to Object Matching in Videos. In: Proc. Ninth Int’l Conf. Computer Vision, pp. 1470–1478 (2003)

    Google Scholar 

  2. Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2, 141–151 (2000)

    Article  Google Scholar 

  3. Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-Visual Automatic Speech Recognition: An Overview. In: Bailly, G., Vatikiotis-Bateson, E., Perrier, P. (eds.) Issues in Visual and Audio-Visual Speech Processing. MIT Press (2004)

    Google Scholar 

  4. Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., Zhou, J.: Audio-visual speech recognition. In: Final Workshop 2000 Report, vol. 764 (2000)

    Google Scholar 

  5. Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia 11(7), 1254–1265 (2009)

    Article  Google Scholar 

  6. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 886–893. IEEE (2005)

    Google Scholar 

  7. Bai, Y., Guo, L., Jin, L., Huang, Q.: A novel feature extraction method using pyramid histogram of orientation gradients for smile recognition. In: 2009 16th IEEE International Conference on Image Processing (ICIP), vol. 2, pp. 3305–3308. IEEE (2009)

    Google Scholar 

  8. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. John Wiley, New York (2001)

    MATH  Google Scholar 

  9. Ten Holt, G.A., Reinders, M.J.T., Hendriks, E.A.: Multi-dimensional dynamic time warping for gesture recognition. In: Proc. of the Conference of the Advanced School for Computing and Imaging, ASCI 2007 (2007)

    Google Scholar 

  10. Senin, P.: Dynamic time warping algorithm review, Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA (2008)

    Google Scholar 

  11. Neuhaus, M., Bunke, H.: Edit distance based kernel functions for structural pattern classification. Pattern Recognition 39(10), 1852–1863 (2006)

    Article  MATH  Google Scholar 

  12. Bahlmann, C., Haasdonk, B., Burkhardt, H.: On-line handwriting recognition with support vector machines—a kernel approach. In: Proc. 8th Int. Workshop Front. Handwriting Recognition (IWFHR), pp. 49–54 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jeongwoo Ju .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Ju, J., Jung, H., Kim, J. (2013). Speaker Dependent Visual Speech Recognition by Symbol and Real Value Assignment. In: Kim, JH., Matson, E., Myung, H., Xu, P. (eds) Robot Intelligence Technology and Applications 2012. Advances in Intelligent Systems and Computing, vol 208. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37374-9_98

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37374-9_98

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37373-2

  • Online ISBN: 978-3-642-37374-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics