Abstract
In usability studies involving eye-tracking, quantitative analysis of gaze data requires the information about so called scene occurrences. Scene ocurrences are time segments during which the application user interface remains more-less static, so gaze events (e.g., fixations) can be mapped to the particular areas of interest (user interface elements). The scene occurrences typically start and end by user interface changes such as page-to-page transitions, menu expansions, overlay propmts, etc. Normally, one would record such changes programmatically through application logging, yet in many studies, this is not possible. For example, in an early-prototype mobile-app testing, only a camera recording of a smart device screen is often available as evidence. In such cases, analysts must manually annotate the recordings. To reduce the need for manual annotation of scene occurrences, we present an image processing method for segmenting user interface video recordings. The method exploits specific properties of user interface recordings, which greatly differ from real world video shots (for which many segmentation methods exist). The core of our method lies in the use of SSIM and SIFT similarity metrics used on video frames (with several pre-processing and filtering procedures). The main advantage of our method is, that it requires no training data apart from single screenshot example for each scene (to which the recording frames are compared). The method is also able to work with user finger overlays, which are always present in mobile device recordings. We evaluate the accuracy of our method over recordings from several real-life studies and compare it with other image similarity techniques.
Similar content being viewed by others
Notes
References
Banovic N, Grossman T, Matejka J, Fitzmaurice G (2012) Waken: reverse engineering usage information and interface structure from software videos. In: Proceedings of the 25th annual ACM symposium on user interface software and technology, UIST ’12. ACM, New York, pp 83–92, https://doi.org/10.1145/2380116.2380129, (to appear in print)
Bao L, Li J, Xing Z, Wang X, Zhou B (2015) scvripper: video scraping tool for modeling developers’ behavior using interaction data. In: Proceedings of the 37th international conference on software engineering - volume 2, ICSE ’15. http://dl.acm.org/citation.cfm?id=2819009.2819134. IEEE Press, Piscataway, pp 673–676
Chang TH, Yeh T, Miller RC (2010) Gui testing using computer vision. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’10. ACM, New York, pp 1535–1544, https://doi.org/10.1145/1753326.1753555, (to appear in print)
Ciresan DC, Meier U, Masci J, Maria Gambardella L, Schmidhuber J (2011) Flexible, high performance convolutional neural networks for image classification. In: IJCAI Proceedings-international joint conference on artificial intelligence, vol 22, Barcelona, p 1237
Denoue L, Carter S, Cooper M (2016) Docugram: turning screen recordings into documents. In: Proceedings of the 2016 ACM symposium on document engineering, DocEng ’16. ACM, New York, pp 185–188, https://doi.org/10.1145/2960811.2967154, (to appear in print)
Dixon M, Fogarty J (2010) Prefab: implementing advanced behaviors using pixel-based reverse engineering of interface structure. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’10. ACM, New York, pp 1525–1534, https://doi.org/10.1145/1753326.1753554, (to appear in print)
Dixon M, Laput G, Fogarty J (2014) Pixel-based methods for widget state and style in a runtime implementation of sliding widgets. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’14. ACM, New York, pp 2231–2240, https://doi.org/10.1145/2556288.2556979, (to appear in print)
Duchowski AT (2007) Eye tracking methodology: theory and practice. Springer-Verlag New York, Inc., Secaucus
Givens P, Chakarov A, Sankaranarayanan S, Yeh T (2013) Exploring the internal state of user interfaces by combining computer vision techniques with grammatical inference. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13. IEEE Press, Piscataway, pp 1165–1168. http://dl.acm.org/citation.cfm?id=2486788.2486951
Haralick RM, Sternberg SR, Zhuang X (1987) Image analysis using mathematical morphology. IEEE Trans Pattern Anal Mach Intell 4:532–550
Holmqvist K, Nyström M, Andersson R, Dewhurst R, Jarodzka H, van de Weijer J (2011) Eye tracking: a comprehensive guide to methods and measures. OUP Oxford. https://books.google.sk/books?id=5rIDPV1EoLUC
Koch G, Zemel R, Salakhutdinov R (2015) Siamese neural networks for one-shot image recognition. In: ICML Deep learning workshop, vol 2
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
Mendi E, Bayrak C (2010) Shot boundary detection and key frame extraction using salient region detection and structural similarity. In: Proceedings of the 48th annual southeast regional conference, ACM SE ’10. ACM, New York, pp 66:1–66:4, https://doi.org/10.1145/1900008.1900096, (to appear in print)
Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9(1):62–66
Pongnumkul S, Dontcheva M, Li W, Wang J, Bourdev L, Avidan S, Cohen MF (2011) Pause-and-play: automatically linking screencast video tutorials with applications. In: Proceedings of the 24th annual ACM symposium on user interface software and technology, UIST ’11. ACM, New York, pp 135–144, https://doi.org/10.1145/2047196.2047213, (to appear in print)
Priya GGL, Domnic S (2010) Video cut detection using dominant color features. In: Proceedings of the first international conference on intelligent interactive technologies and multimedia, IITM ’10. ACM, New York, pp 130–134, https://doi.org/10.1145/1963564.1963586, (to appear in print)
Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: 2014 IEEE Conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 512–519
Tahaghoghi SMM, Williams HE, Thom JA, Volkmer T (2005) Video cut detection using frame windows. In: Proceedings of the twenty-eighth Australasian conference on computer science - volume 38, ACSC ’05. Australian Computer Society, Inc., Darlinghurst, pp 193–199. http://dl.acm.org/citation.cfm?id=1082161.1082183
Tao D, Cheng J, Song M, Lin X (2016) Manifold ranking-based matrix factorization for saliency detection. IEEE Trans Neural Netw Learn Syst 27(6):1122–1134
Tonomura Y, Akutsu A, Otsuji K, Sadakata T (1993) Videomap and videospaceicon: tools for anatomizing video content. In: Proceedings of the INTERACT ’93 and CHI ’93 conference on human factors in computing systems, CHI ’93. ACM, New York, pp 131–136, https://doi.org/10.1145/169059.169117, (to appear in print)
Truong BT, Dorai C, Venkatesh S (2000) New enhancements to cut, fade, and dissolve detection processes in video segmentation. In: Proceedings of the eighth ACM international conference on multimedia, MULTIMEDIA ’00. ACM, New York, pp 219–227, https://doi.org/10.1145/354384.354481, (to appear in print)
Vinyals O, Blundell C, Lillicrap T, Wierstra D, et al. (2016) Matching networks for one shot learning. In: Advances in neural information processing systems, pp 3630–3638
Wang R, Tao D (2016) Non-local auto-encoder with collaborative stabilization for image restoration. IEEE Trans Image Process 25(5):2117–2129. https://doi.org/10.1109/TIP.2016.2541318
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612. https://doi.org/10.1109/TIP.2003.819861
Yang X, Liu W, Tao D, Cheng J (2017) Canonical correlation analysis networks for two-view image recognition. Inf Sci 385(C):338–352. https://doi.org/10.1016/j.ins.2017.01.011
Yeh T, Chang TH, Miller RC (2009) Sikuli: using gui screenshots for search and automation. In: Proceedings of the 22Nd Annual ACM symposium on user interface software and technology, UIST ’09. ACM, New York, pp 183–192, https://doi.org/10.1145/1622176.1622213, (to appear in print)
Acknowledgements
This work was partially supported by the Scientific Grant Agency of the Slovak Republic, grant No. VG 1/0646/15, the Slovak Research and Development Agency under the contract No. APVV-15-0508 and was created with the support of the Ministry of Education, Science, Research and Sport of the Slovak Republic within the Research and Development Operational Programme for the project ”University Science Park of STU Bratislava”, ITMS 26240220084, co-funded by the ERDF.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Simko, J., Vrba, J. Screen recording segmentation to scenes for eye-tracking analysis. Multimed Tools Appl 78, 2401–2425 (2019). https://doi.org/10.1007/s11042-018-6369-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6369-7