poster

Totalrecall: visualization and semi-automatic annotation of very large audio-visual corpora

Authors:

Brandon RoyAuthors Info & Claims

ICMI '07: Proceedings of the 9th international conference on Multimodal interfaces

Pages 208 - 215

https://doi.org/10.1145/1322192.1322229

Published: 12 November 2007 Publication History

Abstract

We introduce a system for visualizing, annotating, and analyzing very large collections of longitudinal audio and video recordings. The system, TotalRecall, is designed to address the requirements of projects like the Human Speechome Project, for which more than 100,000 hours of multitrack audio and video have been collected over a twentytwo month period. Our goal in this project is to transcribe speech in over 10,000 hours of audio recordings, and to annotate the position and head orientation of multiple people in the 10,000 hours of corresponding video. Higher level behavioral analysis of the corpus will be based on these and other annotations. To efficiently cope with this huge corpus, we are developing semi-automatic data coding methods that are integrated into TotalRecall. Ultimately, this system and the underlying methodology may enable new forms of multimodal behavioral analysis grounded in ultradense longitudinal data.

References

[1]

J. Allen. Natural language understanding (2nd ed.). Benjamin-Cummings Publishing Co., Inc., Redwood City, CA, USA, 2 edition, 1995.

Digital Library

[2]

C. Barras, E. Geoffrois, Z. Wu, and M. Liberman. Transcriber: development and use of a tool for assisting speech corpora production. Speech Communication, 33(1--2):5--22, 2001.

Digital Library

[3]

P. Boersma. Praat, a system for doing phonetics by computer. Glot International, 5(9/10):341--345, 2001.

[4]

X. L. C. Brolly, C. Stratelos, and J. B. Mulligan. Model-based head pose estimation for air-traffic controllers. In ICIP (2), pages 113--116, 2003.

[5]

V. Comaniciu and P. Meer. Kernel-based object tracking, 2003.

[6]

G. Daniel and M. Chen. Video visualization. In R. M. Greg Turk, Jarke J. van Wijk, editor, IEEE Visualization 2003, pages 409--416, Seattle, Washington, USA, October 2003. IEEE Press.

Digital Library

[7]

L. Dybkjær and N. O. Bernsen. Towards general-purpose annotation tools--how far are we today? In Proceedings of the Fourth International Conference on Language Resources and Evaluation LREC'2004, volume I, pages 197--200, Lisbon, Portugal, May 2004.

[8]

M. Fleischman, P. Decamp, and D. Roy. Mining temporal patterns of movement for video content classification. In MIR '06: Proceedings of the 8th ACM international workshop on Multimedia information retrieval, pages 183--192, New York, NY, USA, 2006. ACM Press.

Digital Library

[9]

G. W. Furnas. Generalized fisheye views. In CHI '86: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 16--23, New York, NY, USA, 1986. ACM Press.

Digital Library

[10]

H. Gish and M. Schmidt. Text-independent speaker identification. IEEE Signal Processing Magazine, October 1994.

[11]

D. Henja and B. Musicus. The solafs time-scale modification algorithm. Technical report, BBN, July 1991.

[12]

M. Kipp. Anvil - a generic annotation tool for multimodal dialogue. In Proc. Eurospeech., 2001.

[13]

B. MacWhinney. The CHILDES Project: Tools for Analyzing Talk. Lawrence Erlbaum Associates, Mahwah, NJ, 3rd edition, 2000.

[14]

K. Maeda, S. Bird, X. Ma, and H. Lee. The annotation graph toolkit: software components for building linguistic annotation tools. In HLT '01: Proceedings of the first international conference on Human language technology research, pages 1--6, Morristown, NJ, USA, 2001. Association for Computational Linguistics.

Digital Library

[15]

A. Manzanera and J. Richefeu. A robust and computationally efficient motion detection algorithm based on sigma--delta background estimation. Proceedings Indian Conference on Computer Vision, Graphics and Image Processing, 2004.

[16]

J. Milde and U. Gut. The tasx-environment: An xml-based corpus database for time-aligned language data. In Proceedings of IRCS workshop of linguistic databases., 2001.

[17]

D. Reidsma, N. Jovanović, and D. Hofs. Designing annotation tools based on properties of annotation problems. In Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research, 30 August--2 September 2005 2005.

[18]

D. Roy, R. Patel, P. DeCamp, R. Kubat, M. Fleischman, B. Roy, N. Mavridis, S. Tellex, A. Salata, J. Guinness, M. Levit, and P. Gorniak. The human speechome project. In Proceedings of the 28th Annual Cognitive Science Conference., pages 2059--2064, 2006.

Digital Library

[19]

K. Sjlander and J. Beskow. Wavesurfer - an open source speech tool. In Proc. of ICSLP, volume 4, pages 464--467, Beijing, Oct. 16--20 2000.

[20]

G. Tzanetakis and P. Cook. Multifeature audio segmentation for browsing and annotation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, October 1999.

[21]

W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf, and J. Woelfel. Sphinx-4: A flexible open source framework for speech recognition. Sun Microsystems Technical Report, (TR--2004--139), November 2004.

Digital Library

[22]

I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, June 2005.

Digital Library

Cited By

Cruz-Sandoval DBeltran-Marquez JGarcia-Constantino MGonzalez-Jasso LFavela JLopez-Nava ICleland IEnnis AHernandez-Cruz NRafferty JSynnott JNugent C(2019)Semi-Automated Data Labeling for Activity Recognition in Pervasive HealthcareSensors10.3390/s1914303519:14(3035)Online publication date: 10-Jul-2019
https://doi.org/10.3390/s19143035
Kim BPardo B(2018)A Human-in-the-Loop System for Sound Event Detection and AnnotationACM Transactions on Interactive Intelligent Systems10.1145/32143668:2(1-23)Online publication date: 21-Jun-2018
https://dl.acm.org/doi/10.1145/3214366
Ikegawa KIshii AOkamura KShizuki BTakahashi S(2018)Investigating Effects of Users’ Background in Analyzing Long-Term Images from a Stationary CameraHuman Interface and the Management of Information. Interaction, Visualization, and Analytics10.1007/978-3-319-92043-6_41(490-504)Online publication date: 7-Jun-2018
https://doi.org/10.1007/978-3-319-92043-6_41
Show More Cited By

Index Terms

Totalrecall: visualization and semi-automatic annotation of very large audio-visual corpora
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Graphical user interfaces

Recommendations

Transcription Correction Using Group Delay Processing for Continuous Speech Recognition

Three major areas have been the focus in the literature to improve ASR performance, namely enhanced acoustic modeling, use of new acoustic features and contributions to the language modeling. An aspect that is less frequently considered is the effect of ...
Crowd-sourcing prosodic annotation

Untrained annotators performed rapid prosodic annotation for conversational speech.Interannotator reliability was similar for crowdsourced and labbased annotators.Same acoustic and contextual cues predict expert and nonexpert prosodic ...
Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing

This paper describes the development of a multilingual and multigenre manually annotated speech dataset, freely available to the research community as ground truth for the evaluation of automatic transcription systems and spoken language translation ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '07: Proceedings of the 9th international conference on Multimodal interfaces

November 2007

402 pages

ISBN:9781595938176

DOI:10.1145/1322192

General Chairs:
Kenji Mase
Nagoya University, Japan
,
Dominic Massaro
UC Santa Cruz, USA
,
Program Chairs:
Kazuya Takeda
Nagoya University, Japan
,
Deb Roy
MIT, USA
,
Alexandros Potamianos
Technical University of Crete, Greece

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

ICMI07

Sponsor:

ICMI07: International Conference on Multimodal Interface

November 12 - 15, 2007

Aichi, Nagoya, Japan

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
313
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cruz-Sandoval DBeltran-Marquez JGarcia-Constantino MGonzalez-Jasso LFavela JLopez-Nava ICleland IEnnis AHernandez-Cruz NRafferty JSynnott JNugent C(2019)Semi-Automated Data Labeling for Activity Recognition in Pervasive HealthcareSensors10.3390/s1914303519:14(3035)Online publication date: 10-Jul-2019
https://doi.org/10.3390/s19143035
Kim BPardo B(2018)A Human-in-the-Loop System for Sound Event Detection and AnnotationACM Transactions on Interactive Intelligent Systems10.1145/32143668:2(1-23)Online publication date: 21-Jun-2018
https://dl.acm.org/doi/10.1145/3214366
Ikegawa KIshii AOkamura KShizuki BTakahashi S(2018)Investigating Effects of Users’ Background in Analyzing Long-Term Images from a Stationary CameraHuman Interface and the Management of Information. Interaction, Visualization, and Analytics10.1007/978-3-319-92043-6_41(490-504)Online publication date: 7-Jun-2018
https://doi.org/10.1007/978-3-319-92043-6_41
Raad MBayan MDalloul YGhareeb MHaj-Ali A(2018)Arabic Multimedia Search PlatformRecent Trends in Computer Applications10.1007/978-3-319-89914-5_14(237-249)Online publication date: 20-Nov-2018
https://doi.org/10.1007/978-3-319-89914-5_14
Kim BPardo BPapadopoulos GKuflik TChen FDuarte CFu W(2017)I-SEDProceedings of the 22nd International Conference on Intelligent User Interfaces10.1145/3025171.3025231(553-557)Online publication date: 7-Mar-2017
https://dl.acm.org/doi/10.1145/3025171.3025231
Eika ESandnes F(2017)Rethinking Audio Visualizations: Towards Better Visual Search in Audio Editing InterfacesUniversal Access in Human–Computer Interaction. Human and Technological Environments10.1007/978-3-319-58700-4_33(410-418)Online publication date: 17-May-2017
https://doi.org/10.1007/978-3-319-58700-4_33
Ishii AAbe THakoda HShizuki BTanaka J(2016)Evaluation of a System to Analyze Long-Term Images from a Stationary CameraHuman Interface and the Management of Information: Information, Design and Interaction10.1007/978-3-319-40349-6_26(275-286)Online publication date: 21-Jun-2016
https://doi.org/10.1007/978-3-319-40349-6_26
Roy BFrank MDeCamp PMiller MRoy D(2015)Predicting the birth of a spoken wordProceedings of the National Academy of Sciences10.1073/pnas.1419773112112:41(12663-12668)Online publication date: 21-Sep-2015
https://doi.org/10.1073/pnas.1419773112
Vosoughi SGoodwin MWashabaugh BRoy DMorency LBohus DAghajan HCassell JNijholt AEpps J(2012)A portable audio/video recorder for longitudinal study of child developmentProceedings of the 14th ACM international conference on Multimodal interaction10.1145/2388676.2388715(193-200)Online publication date: 22-Oct-2012
https://dl.acm.org/doi/10.1145/2388676.2388715
Piringer HBuchetics MBenedik R(2012)AlVisProceedings of the 2012 IEEE Conference on Visual Analytics Science and Technology (VAST)10.1109/VAST.2012.6400556(153-162)Online publication date: 14-Oct-2012
https://dl.acm.org/doi/10.1109/VAST.2012.6400556
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten