Skip to main content
Log in

A real-time system for online learning-based visual transcription of piano music

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In order to deal with the challenges arising from acoustic-based music information retrieval such as automatic music transcription, the video of the musical performances can be utilized. In this paper, a new real-time learning-based system for visually transcribing piano music using the CNN-SVM classification of the pressed black and white keys is presented. The whole process in this technique is based on visual analysis of the piano keyboard and the pianist’s hands and fingers. A high accuracy with an average F1 score of 0.95 even under non-ideal camera view, hand coverage, and lighting conditions is achieved. The proposed system has a low latency (about 20 ms) in real-time music transcription. In addition, a new dataset for visual transcription of piano music is created and made available to researchers in this area. Since not all possible varying patterns of the data used in our work are available, an online learning approach is applied to efficiently update the original model based on the new data added to the training dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. All videos can be downloaded from http://www.sfu.ca/akbari/MTA/Dataset.

  2. The videos can be downloaded from http://www.sfu.ca/akbari/MTA/OnlineLearningExperiments.

  3. The test videos and the classification results can be downloaded from http://www.sfu.ca/akbari/MTA.

References

  1. Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: European conference on machine learning, pp 39–50

  2. Akbari M (2014) claVision: bisual automatic piano music transcription. Master’s thesis, University of Lethbridge, Lethbridge

  3. Akbari M, Cheng H (2015) claVision: visual automatic piano music transcription. In: Proceedings of the international conference on new interfaces for musical expression. Louisiana State University, Baton Rouge, pp 313–314

  4. Akbari M, Cheng H (2015) Real-time piano music transcription based on computer vision. IEEE Trans Multimed 17(12):2113–2121

    Article  Google Scholar 

  5. Akbari M, Cheng H (2016), Methods and systems for visual music transcription. http://www.google.com/patents/US9418637. US Patent 9,418,637

  6. Baniya BK, Lee J (2016) Importance of audio feature reduction in automatic music genre classification. Multimed Tools Appl 75(6):3013–3026

    Article  Google Scholar 

  7. Baur D, Seiffert F, Sedlmair M, Boring S (2010) The streams of our lives: visualizing listening histories in context. IEEE Trans Vis Comput Graph 16 (6):1119–1128

    Article  Google Scholar 

  8. Bazzica A, Liem C, Hanjalic A (2016) On detecting the playing/non-playing activity of musicians in symphonic music videos. Comput Vis Image Underst 144:188–204

    Article  Google Scholar 

  9. Ben-Hur A, Weston J (2010) A user’s guide to support vector machines. In: Data mining techniques for the life sciences, pp 223–239

  10. Benetos E, Dixon S (2012) A shift-invariant latent variable model for automatic music transcription. Comput Music J 36(4):81–94

    Article  Google Scholar 

  11. Benetos E, Weyde T (2015) An efficient temporally-constrained probabilistic model for multiple-instrument music transcription. In: International society for music information retrieval, pp 701–707

  12. Benetos E, Dixon S, Giannoulis D, Kirchhoff H, Klapuri A (2013) Automatic music transcription: challenges and future directions. J Intell Inf Syst 41:407–434

    Article  Google Scholar 

  13. Bertin N, Badeau R, Vincent E (2010) Enforcing harmonicity and smoothness in bayesian non-negative matrix factorization applied to polyphonic music transcription. IEEE Trans Audio Speech Lang Process 18(3):538–549

    Article  Google Scholar 

  14. Böck S, Schedl M (2012) Polyphonic piano note transcription with recurrent neural networks. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 121–124

  15. Borjian N, Kabir E, Seyedin S, Masehian E (2017) A query-by-example music retrieval system using feature and decision fusion. Multimed Tools Appl 1–25. https://doi.org/10.1007/s11042-017-4524-1

  16. Brown S (2006) The perpetual music track: the phenomenon of constant musical imagery. J Conscious Stud 13(6):43–62

    Google Scholar 

  17. Cao X, Sun L, Niu J, Wu R, Liu Y, Cai H (2015) Automatic composition of happy melodies based on relations. Multimed Tools Appl 74(21):9097–9115

    Article  Google Scholar 

  18. Cemgil AT, Kappen HJ, Barber D (2006) A generative model for music transcription. IEEE Trans Audio Speech Lang Process 14(2):679–694

    Article  Google Scholar 

  19. Chang C, Lin C (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27

    Google Scholar 

  20. Chang H, Huang S, Wu J (2016) A personalized music recommendation system based on electroencephalography feedback. Multimed Tools Appl 1–20. https://doi.org/10.1007/s11042-015-3202-4

  21. Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newsl 6(1):1–6

    Article  Google Scholar 

  22. Corrêa DC, Rodrigues FA (2016) A survey on symbolic data-based music genre classification. Exp Syst Appl 60:190–210

    Article  Google Scholar 

  23. Dannenberg RB (1993) Music representation issues, techniques, and systems. Comput Music J 17(3):20–30

    Article  Google Scholar 

  24. Davy M, Godsill SJ (2003) Bayesian harmonic models for musical signal analysis. Bayesian Stat 7:105–124

    MathSciNet  Google Scholar 

  25. de Souza C (2014) Accord.net framework. http://www.accord-framework.net

  26. Downie JS (2003) Music information retrieval. Annu Rev Inf Sci Technol 37 (1):295–340

    Article  Google Scholar 

  27. Duan Z, Pardo B, Zhang C (2010) Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Trans Audio Speech Lang Process 18(8):2121–2133

    Article  Google Scholar 

  28. Farquad M, Bose I (2012) Preprocessing unbalanced data using support vector machine. Decis Support Syst 53(1):226–233

    Article  Google Scholar 

  29. Frisson C, Reboursière L, Chu W, Lähdeoja O, Mills Iii J, Picard C, Shen A, Todoroff T (2009) Multimodal guitar: performance toolbox and study workbench. QPSR of the Numediart Res Progr 2(3):67–84

    Google Scholar 

  30. Geng M, Wang Y, Tian Y, Huang T (2016) Cnusvm: Hybrid cnn-uneven svm model for imbalanced visual learning. In: IEEE second international conference on multimedia big data (BigMM), pp 186–193

  31. Gorodnichy DO, Yogeswaran A (2006) Detection and tracking of pianist hands and fingers. In: 2006 The 3rd Canadian conference on computer and robot vision, p 63

  32. Gutiérrez S, García S (2016) Landmark-based music recognition system optimisation using genetic algorithms. Multimed Tools Appl 75(24):16905–16922

    Article  Google Scholar 

  33. Karpathy A (2016) Convnetsharp. https://github.com/cbovar/ConvNetSharp

  34. Katarya R, Verma OP Efficient music recommender system using context graph and particle swarm. Multimed Tools Appl 1–15. https://doi.org/10.1007/s11042-017-4447-x

  35. Klapuri AP (2003) Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Trans Speech Audio Process 11(6):804–816

    Article  Google Scholar 

  36. Klapuri A (2004) Automatic music transcription as we know it today. J New Music Res 33(3):269–282

    Article  Google Scholar 

  37. Laskov P, Gehl C, Krüger S, Müller K (2006) Incremental support vector learning: analysis, implementation and applications. J Mach Learn Res 7:1909–1936

    MathSciNet  MATH  Google Scholar 

  38. Lin C, Weng R, Keerthi S (2008) Trust region newton method for logistic regression. J Mach Learn Res 9:627–650

    MathSciNet  MATH  Google Scholar 

  39. Maler A (2013) Songs for hands: analyzing interactions of sign language and music. Music Theory Online 19(1):1–15

    Article  Google Scholar 

  40. Nanni L, Costa YM, Lumini A, Kim MY, Baek SR (2016) Combining visual and acoustic features for music genre classification. Exp Syst Appl 45:108–117

    Article  Google Scholar 

  41. Oka A, Hashimoto M (2013) Marker-less piano fingering recognition using sequential depth images. In: 2013 19th Korea-Japan joint workshop on frontiers of computer vision, (FCV), pp 1–4

  42. Paleari M, Huet B, Schutz A, Slock D (2008) A multimodal approach to music transcription. In: 15th IEEE international conference on image processing, pp 93–96

  43. Peeling PH, Godsill SJ (2011) Multiple pitch estimation using non-homogeneous poisson processes. IEEE J Sel Top Sign Process 5(6):1133–1143

    Article  Google Scholar 

  44. Pertusa A, Iñesta JM (2005) Polyphonic monotimbral music transcription using dynamic networks. Pattern Recogn Lett 26(12):1809–1818

    Article  Google Scholar 

  45. Poast M (2000) Color music: visual color notation for musical expression. Leonardo 33(3):215–221

    Article  Google Scholar 

  46. Quested G, Boyle R, Ng K (2008) Polyphonic note tracking using multimodal retrieval of musical events. In: Proceedings of the international computer music conference (ICMC)

  47. Reboursière L, Frisson C, Lähdeoja O, Mills Iii J, Picard C, Todoroff T (2010) MultimodalGuitar: a toolbox for augmented guitar performances. In: Proceedings of the New Interfaces for Musical Expression++ (NIME++)

  48. Scarr J, Green R (2010) Retrieval of guitarist fingering information using computer vision. In: 25th international conference of image and vision computing New Zealand (IVCNZ), pp 1–7

  49. Schindler A, Rauber A (2016) Harnessing music-related visual stereotypes for music information retrieval. ACM Trans Intell Syst Technol (TIST) 8(2):20

    Google Scholar 

  50. Seger RA, Wanderley MM, Koerich AL (2014) Automatic detection of musicians’ ancillary gestures based on video analysis. Exp Syst Appl 41(4):2098–2106

    Article  Google Scholar 

  51. Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 806–813

  52. Sigtia S, Benetos E, Boulanger-Lewandowski N, Weyde T, d’Avila Garcez AS, Dixon S (2015) A hybrid recurrent neural network for music transcription. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2061–2065

  53. Sigtia S, Benetos E, Dixon S (2016) An end-to-end neural network for polyphonic piano music transcription. IEEE/ACM Trans Audio Speech Lang Process (TASLP) 24(5):927–939

    Article  Google Scholar 

  54. Sotirios M, Georgios P (2008) Computer vision method for pianist’s fingers information retrieval. In: Proceedings of the 10th international conference on information integration and web-based applications & services, iiWAS ’08. ACM, pp 604–608

  55. Stober S, Nürnberger A (2013) Adaptive music retrieval–a state of the art. Multimed Tools Appl 65(3):467–494

    Article  Google Scholar 

  56. Suteparuk P (2014) Detection of piano keys pressed in video. Tech. rep., Department of Computer Science, Stanford University

  57. Tavares TF, Odowichuck G, Zehtabi S, Tzanetakis G (2012) Audio-visual vibraphone transcription in real time. In: 2012 IEEE 14th international workshop on multimedia signal processing (MMSP), pp 215–220

  58. Tavares TF, Barbedo JGA, Attux R, Lopes A (2013) Survey on automatic transcription of music. J Braz Comput Soc 19(4):589–604

    Article  Google Scholar 

  59. Taweewat P, Wutiwiwatchai C (2013) Musical pitch estimation using a supervised single hidden layer feed-forward neural network. Exp Syst Appl 40(2):575–589

    Article  Google Scholar 

  60. Thompson WF, Graham P, Russo FA (2005) Seeing music performance: visual influences on perception and experience. Semiotica 2005(156):203–227

    Article  Google Scholar 

  61. Tsai C, Lin C, Lin C (2014) Incremental and decremental training for linear classification. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 343–352

  62. Yoshii K, Goto M (2012) A nonparametric bayesian multipitch analyzer based on infinite latent harmonic allocation. IEEE Trans Audio Speech Lang Process 20 (3):717–730

    Article  Google Scholar 

  63. Zhang B, Wang Y (2009) Automatic music transcription using audio-visual fusion for violin practice in home environment. Tech. Rep. TRA7/09, School of Computing, National University of Singapore

  64. Zhang B, Zhu J, Wang Y, Leow WK (2007) Visual analysis of fingering for pedagogical violin transcription. In: Proceedings of the 15th international conference on multimedia, pp 521–524

Download references

Acknowledgements

This work was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada under grant RGPIN312262, STPGP447223, RGPAS478109, and RGPIN288300.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad Akbari.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Akbari, M., Liang, J. & Cheng, H. A real-time system for online learning-based visual transcription of piano music. Multimed Tools Appl 77, 25513–25535 (2018). https://doi.org/10.1007/s11042-018-5803-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-5803-1

Keywords

Navigation