skip to main content
10.1145/1322192.1322229acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
poster

Totalrecall: visualization and semi-automatic annotation of very large audio-visual corpora

Published: 12 November 2007 Publication History

Abstract

We introduce a system for visualizing, annotating, and analyzing very large collections of longitudinal audio and video recordings. The system, TotalRecall, is designed to address the requirements of projects like the Human Speechome Project, for which more than 100,000 hours of multitrack audio and video have been collected over a twentytwo month period. Our goal in this project is to transcribe speech in over 10,000 hours of audio recordings, and to annotate the position and head orientation of multiple people in the 10,000 hours of corresponding video. Higher level behavioral analysis of the corpus will be based on these and other annotations. To efficiently cope with this huge corpus, we are developing semi-automatic data coding methods that are integrated into TotalRecall. Ultimately, this system and the underlying methodology may enable new forms of multimodal behavioral analysis grounded in ultradense longitudinal data.

References

[1]
J. Allen. Natural language understanding (2nd ed.). Benjamin-Cummings Publishing Co., Inc., Redwood City, CA, USA, 2 edition, 1995.
[2]
C. Barras, E. Geoffrois, Z. Wu, and M. Liberman. Transcriber: development and use of a tool for assisting speech corpora production. Speech Communication, 33(1--2):5--22, 2001.
[3]
P. Boersma. Praat, a system for doing phonetics by computer. Glot International, 5(9/10):341--345, 2001.
[4]
X. L. C. Brolly, C. Stratelos, and J. B. Mulligan. Model-based head pose estimation for air-traffic controllers. In ICIP (2), pages 113--116, 2003.
[5]
V. Comaniciu and P. Meer. Kernel-based object tracking, 2003.
[6]
G. Daniel and M. Chen. Video visualization. In R. M. Greg Turk, Jarke J. van Wijk, editor, IEEE Visualization 2003, pages 409--416, Seattle, Washington, USA, October 2003. IEEE Press.
[7]
L. Dybkjær and N. O. Bernsen. Towards general-purpose annotation tools--how far are we today? In Proceedings of the Fourth International Conference on Language Resources and Evaluation LREC'2004, volume I, pages 197--200, Lisbon, Portugal, May 2004.
[8]
M. Fleischman, P. Decamp, and D. Roy. Mining temporal patterns of movement for video content classification. In MIR '06: Proceedings of the 8th ACM international workshop on Multimedia information retrieval, pages 183--192, New York, NY, USA, 2006. ACM Press.
[9]
G. W. Furnas. Generalized fisheye views. In CHI '86: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 16--23, New York, NY, USA, 1986. ACM Press.
[10]
H. Gish and M. Schmidt. Text-independent speaker identification. IEEE Signal Processing Magazine, October 1994.
[11]
D. Henja and B. Musicus. The solafs time-scale modification algorithm. Technical report, BBN, July 1991.
[12]
M. Kipp. Anvil - a generic annotation tool for multimodal dialogue. In Proc. Eurospeech., 2001.
[13]
B. MacWhinney. The CHILDES Project: Tools for Analyzing Talk. Lawrence Erlbaum Associates, Mahwah, NJ, 3rd edition, 2000.
[14]
K. Maeda, S. Bird, X. Ma, and H. Lee. The annotation graph toolkit: software components for building linguistic annotation tools. In HLT '01: Proceedings of the first international conference on Human language technology research, pages 1--6, Morristown, NJ, USA, 2001. Association for Computational Linguistics.
[15]
A. Manzanera and J. Richefeu. A robust and computationally efficient motion detection algorithm based on sigma--delta background estimation. Proceedings Indian Conference on Computer Vision, Graphics and Image Processing, 2004.
[16]
J. Milde and U. Gut. The tasx-environment: An xml-based corpus database for time-aligned language data. In Proceedings of IRCS workshop of linguistic databases., 2001.
[17]
D. Reidsma, N. Jovanović, and D. Hofs. Designing annotation tools based on properties of annotation problems. In Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research, 30 August--2 September 2005 2005.
[18]
D. Roy, R. Patel, P. DeCamp, R. Kubat, M. Fleischman, B. Roy, N. Mavridis, S. Tellex, A. Salata, J. Guinness, M. Levit, and P. Gorniak. The human speechome project. In Proceedings of the 28th Annual Cognitive Science Conference., pages 2059--2064, 2006.
[19]
K. Sjlander and J. Beskow. Wavesurfer - an open source speech tool. In Proc. of ICSLP, volume 4, pages 464--467, Beijing, Oct. 16--20 2000.
[20]
G. Tzanetakis and P. Cook. Multifeature audio segmentation for browsing and annotation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, October 1999.
[21]
W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf, and J. Woelfel. Sphinx-4: A flexible open source framework for speech recognition. Sun Microsystems Technical Report, (TR--2004--139), November 2004.
[22]
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, June 2005.

Cited By

View all
  • (2019)Semi-Automated Data Labeling for Activity Recognition in Pervasive HealthcareSensors10.3390/s1914303519:14(3035)Online publication date: 10-Jul-2019
  • (2018)A Human-in-the-Loop System for Sound Event Detection and AnnotationACM Transactions on Interactive Intelligent Systems10.1145/32143668:2(1-23)Online publication date: 21-Jun-2018
  • (2018)Investigating Effects of Users’ Background in Analyzing Long-Term Images from a Stationary CameraHuman Interface and the Management of Information. Interaction, Visualization, and Analytics10.1007/978-3-319-92043-6_41(490-504)Online publication date: 7-Jun-2018
  • Show More Cited By

Index Terms

  1. Totalrecall: visualization and semi-automatic annotation of very large audio-visual corpora

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMI '07: Proceedings of the 9th international conference on Multimodal interfaces
    November 2007
    402 pages
    ISBN:9781595938176
    DOI:10.1145/1322192
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 November 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. multimedia corpora
    2. semi-automation
    3. speech transcription
    4. video annotation
    5. visualization

    Qualifiers

    • Poster

    Conference

    ICMI07
    Sponsor:
    ICMI07: International Conference on Multimodal Interface
    November 12 - 15, 2007
    Aichi, Nagoya, Japan

    Acceptance Rates

    Overall Acceptance Rate 453 of 1,080 submissions, 42%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 26 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Semi-Automated Data Labeling for Activity Recognition in Pervasive HealthcareSensors10.3390/s1914303519:14(3035)Online publication date: 10-Jul-2019
    • (2018)A Human-in-the-Loop System for Sound Event Detection and AnnotationACM Transactions on Interactive Intelligent Systems10.1145/32143668:2(1-23)Online publication date: 21-Jun-2018
    • (2018)Investigating Effects of Users’ Background in Analyzing Long-Term Images from a Stationary CameraHuman Interface and the Management of Information. Interaction, Visualization, and Analytics10.1007/978-3-319-92043-6_41(490-504)Online publication date: 7-Jun-2018
    • (2018)Arabic Multimedia Search PlatformRecent Trends in Computer Applications10.1007/978-3-319-89914-5_14(237-249)Online publication date: 20-Nov-2018
    • (2017)I-SEDProceedings of the 22nd International Conference on Intelligent User Interfaces10.1145/3025171.3025231(553-557)Online publication date: 7-Mar-2017
    • (2017)Rethinking Audio Visualizations: Towards Better Visual Search in Audio Editing InterfacesUniversal Access in Human–Computer Interaction. Human and Technological Environments10.1007/978-3-319-58700-4_33(410-418)Online publication date: 17-May-2017
    • (2016)Evaluation of a System to Analyze Long-Term Images from a Stationary CameraHuman Interface and the Management of Information: Information, Design and Interaction10.1007/978-3-319-40349-6_26(275-286)Online publication date: 21-Jun-2016
    • (2015)Predicting the birth of a spoken wordProceedings of the National Academy of Sciences10.1073/pnas.1419773112112:41(12663-12668)Online publication date: 21-Sep-2015
    • (2012)A portable audio/video recorder for longitudinal study of child developmentProceedings of the 14th ACM international conference on Multimodal interaction10.1145/2388676.2388715(193-200)Online publication date: 22-Oct-2012
    • (2012)AlVisProceedings of the 2012 IEEE Conference on Visual Analytics Science and Technology (VAST)10.1109/VAST.2012.6400556(153-162)Online publication date: 14-Oct-2012
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media