Skip to main content
Log in

Merge-Weighted Dynamic Time Warping for Speech Recognition

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, languageindependent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve the problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several limitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Deng L. Dynamic Speech Models: Theory, Algorithm, and Application. Morgan & Claypool, 2006.

  2. Furui S. History and development of speech recognition. In Speech Technology: Theory and Application, Chen F, Jokinen K (eds.), New York: Springer, 2010, pp.1–18.

  3. Chapaneri S V. Spoken digits recognition using weighted MFCC and improved features for dynamic time warping. International Journal of Computer Application, 2012, 40(3): 6–12.

    Article  Google Scholar 

  4. Cox R V, Kamm C A, Rabiner L R, Schroeter J, Wilpon J G. Speech and language processing for next-millennum communications services. Proc. the IEEE, 2000, 88(8): 1314–1337.

    Article  Google Scholar 

  5. Marti A, Cobos M, Lopez J J. Evaluating the influence of source separation methods in robust automatic speech recognition with a specific cocktail-party training. Audio Engineering Society Convention, 2012. https://secure.aes.org/forum/pubs/conventions/?elib=16273, Mar. 2014.

  6. Levis J, Suvorov R. Automatic speech recognition. In The Encyclopedia of Applied Linguistics, Chapelle C A (ed.), Blackwell Publishing Ltd., 2012.

  7. Feng J, Ramabhadran B, Hansen J H L, Williams J D. Trends in speech and language processing. IEEE Signal Processing Magazine, 2012, 29(1): 177–179.

    Article  Google Scholar 

  8. Talking N Y. In the news. IEEE Intelligent Systems, 2012, 27(2): 2–7.

    Article  Google Scholar 

  9. Kim C, Seo K D. Robust DTW-based recognition algorithm for hand-held consumer devices. IEEE Trans. Consumer Electronics, 2005, 51(2): 699–709.

    Article  MathSciNet  Google Scholar 

  10. Vintsyuk T K. Speech discrimination by dynamic programming. Cybernetics, 1968, 4(1): 52–57.

    Article  MathSciNet  Google Scholar 

  11. Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoustics, Speech and Signal Processing, 1978, 26(1): 43–49.

    Article  MATH  Google Scholar 

  12. Myers C, Rabiner L R, Rosenberg A. Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Trans. Acoustics, Speech and Signal Processing, 1980, 28(6): 623–635.

    Article  MATH  Google Scholar 

  13. Deller J R, Hansen J H L. Proakis J G. Discrete-Time Processing of Speech Signals. Wiley-IEEE Press, 1999.

  14. Abdulla W H, Chow D, Sin G. Cross-words reference template for DTW-based speech recognition systems. In Proc. TENCON, Oct. 2003, pp.1576–1579.

  15. Adami A G, Mihaescu R, Reynolds D A, Godfrey J J. Modeling prosodic dynamics for speaker recognition. In Proc. ICASSP, Apr. 2003, pp.788–791.

  16. Nair N U, Sreenivas T V. Multi pattern dynamic time warping for automatic speech recognition. In Proc. TENCON, Nov. 2008, pp.1–6.

  17. Muda L, Begam M, Elamvazuthi I. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. Journal of Computing, 2010, 2(3): 138–143.

    Google Scholar 

  18. Sheikhan M, Gharavian D, Ashoftedel F. Using DTW neural-based MFCC warping to improve emotional speech recognition. Neural Computing & Applications, 2012, 21(7): 1765–1773.

    Article  Google Scholar 

  19. Wang J, Wang J, Mo M H, Tu C I, Lin S C. The design of a speech interactivity embedded module and its applications for mobile consumer devices. IEEE Trans. Consumer Electronics, 2008, 54(2): 870–876.

    Article  Google Scholar 

  20. Sun J, Sun Y, Abida K, Karray F. A novel template matching approach to speaker-independent Arabic spoken digit recognition. In Proc. AIS, June 2012, pp.192–199.

  21. Berndt D J, Clifford J. Using dynamic time warping to find patterns in time series. In Proc. AAAI Workshop on Knowledge Discovery in Databases, July 1994, pp.359–370.

  22. Keogh E J, Pazzani M J. Scaling up dynamic time warping to massive datasets. In Proc. the 3rd European Conf. PKDD, Sept. 1999, pp.1–11.

  23. Müller M. Information Retrieval for Music and Motion. Heidelberg, New York: Springer-Verlag, 2007.

  24. Kim S W, Park S, Chu W W. An index-based approach for similarity search supporting time warping in large sequence databases. In Proc. Int. Conf. Data Engineering, Apr. 2001, pp.607–614.

  25. Zhu Y, Shasha D. Warping indexes with envelope transforms for query by humming. In Proc. SIGMOD, June 2003, pp.181–192.

  26. Müller M, Mattes H, Kurth F. An efficient multiscale approach to audio synchronization. In Proc. the 7th ISMIR, Oct. 2006, pp.192-197.

  27. Sakurai Y, Yoshikawa M, Faloutsos C. FTW: Fast similarity search under the time warping distance. In Proc. the 24th PODS, June 2005, pp.326–337.

  28. Papapetrou P, Athitsos V, Potamias M, Kollios G, Gunopulos D. Embedding-based subsequence matching in time-series databases. ACM Trans. Database Systems, 2011, 36(3): Article No.17.

  29. Shanker A P, Rajagopalan A N. Off-line signature verification using DTW. Pattern Recognition Letters, 2007, 28(12): 1407–1414.

    Article  Google Scholar 

  30. Jeong Y S, Jeong M K, Omitaomu O A. Weighted dynamic time warping for time series classification. Pattern Recognition, 2011, 44(9): 2231–2240.

    Article  Google Scholar 

  31. Karray F O, De Silva C. Soft Computing and Intelligent Systems Design: Theory, Tools and Applications. Addison-Wesley, 2004.

  32. Keogh E. Exact indexing of dynamic time warping. In Proc. VLDB, Aug. 2002, pp.406–417.

  33. Young S, Evermann G, Gales M et al. The HTK Book (for HTK Version 3.4). Cambridge, UK: Cambridge University Engineering Department, 2006.

    Google Scholar 

  34. Livio M. The Golden Ratio: The Story of PHI, the World’s Most Astonishing Number. Broadway Books, 2003.

  35. Lu A, Maciejewski R, Ebert D S, Volume composition using eye tracking data. In Proc. the 8th EuroVis, Jan. 2006,pp.115–122.

  36. Rabiner L R, Juang B H. Fundamentals of Speech Recognition. Englewood Cliffs, New Jersey: Prentice-Hall, 1993.

  37. Lévy C, Linarμes G, Nocera P. Comparison of several acoustic modeling techniques and decoding algorithms for embedded speech recognition systems. In Proc. Workshop on DSP in Mobile and Vehicular Systems, Apr. 2003.

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Zhi-Gang Luo or Ming Li.

Additional information

This work was supported by the Research Plan Project of National University of Defense Technology under Grant No. JC13-06-01, and the OCRit Project made possible by the Global Leadership Round in Genomics&Life Sciences Grant (GL2).

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 128 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, XL., Luo, ZG. & Li, M. Merge-Weighted Dynamic Time Warping for Speech Recognition. J. Comput. Sci. Technol. 29, 1072–1082 (2014). https://doi.org/10.1007/s11390-014-1491-0

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-014-1491-0

Keywords

Navigation