skip to main content
10.1145/2791405.2791485acmotherconferencesArticle/Chapter ViewAbstractPublication PageswciConference Proceedingsconference-collections
research-article

Content Based Audiobooks Indexing using Apache Hadoop Framework

Published:10 August 2015Publication History

ABSTRACT

In recent years, content based audio indexing has become the key research area, as the audio content defines the content more precisely and has comparatively subservient density. In this paper, we present conversion of audio books into textual information using CMU SPHINX-4 speech transcriber and efficient indexing of audio books using term frequency-inverse document frequency (tf-idf) weights on Apache Hadoop MapReduce framework. In the first phase, audiobook datasets are converted into textual words by training CMU SPHINX-4 speech recognizer with acoustic models. In the next phase, the keywords present in the text file generated from the speech recognizer are filtered using tf-idf weights. Finally, we index audio files based on the keywords extracted from the speech converted text file. As, conversion of speech to text and indexing of audio are space and time intensive tasks, we ported execution of these algorithms on Hadoop MapReduce Framework. Porting content based indexing of audio books on to a Hadoop distributed framework resulted in considerable improvement in time and space utilization. As the amount of data being uploaded and downloaded is escalating, this can be further extended to indexing of image, video and other multimedia forms.

References

  1. Youtube statitics. https://www.youtube.com/yt/press/statistics.html.Google ScholarGoogle Scholar
  2. LR Rabiner and RW Schafer. Digital speech processing. The Froehlich/Kent Encyclopedia of Telecommunications, 6:237--258, 2011.Google ScholarGoogle Scholar
  3. Paul Lamere, Philip Kwok, Evandro Gouvea, Bhiksha Raj, Rita Singh, William Walker, Manfred Warmuth, and Peter Wolf. The cmu sphinx-4 speech recognition system. 1:2--5, 2003.Google ScholarGoogle Scholar
  4. Sphinx-4. http://cmusphinx.sourceforge.net/doc/sphinx4/overview-summary.html.Google ScholarGoogle Scholar
  5. Apache hadoop. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  6. Introduction to the hadoop framework. https://blog.safaribooksonline.com/2012/12/07/introduction-to-the-hadoop-framework/.Google ScholarGoogle Scholar
  7. Andrzej Bialecki, Michael Cafarella, Doug Cutting, and Owen OâĂŹMALLEY. Hadoop: a framework for running applications on large clusters built of commodity hardware. Wiki at http://lucene.apache.org/hadoop, 11, 2005.Google ScholarGoogle Scholar
  8. Hasnain Mandviwala, Scott Blackwell, Chris Weikart, and Jean-Manuel Van Thong. Multimedia content analysis and indexing: Evaluation of a distributed and scalable architecture. pages 137--145, 2003.Google ScholarGoogle Scholar
  9. Adrian Sebastian Paul, D Barbulescu, and F Dragan. A distributed system architecture for audio signal processing. pages 137--142, 2011.Google ScholarGoogle Scholar
  10. Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for mapreduce. pages 938--948, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Benjamin Fish, Jeremy Kun, Ádám Dániel Lelkes, Lev Reyzin, and György Turán. On the computational complexity of mapreduce. arXiv preprint arXiv:1410.0245, 2014.Google ScholarGoogle Scholar
  12. Librivox website. http://librivox.org/.Google ScholarGoogle Scholar
  13. Loyal books free public domain audiobooks and ebook downloads website. http://www.loyalbooks.com/.Google ScholarGoogle Scholar
  14. Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. Scoring, term weighting and the vector space model. Introduction to Information Retrieval, 100, 2008.Google ScholarGoogle Scholar
  15. Stephen Robertson. Understanding inverse document frequency: on theoretical arguments for idf. Journal of documentation, 60(5):503--520, 2004.Google ScholarGoogle Scholar

Index Terms

  1. Content Based Audiobooks Indexing using Apache Hadoop Framework

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            WCI '15: Proceedings of the Third International Symposium on Women in Computing and Informatics
            August 2015
            763 pages
            ISBN:9781450333610
            DOI:10.1145/2791405

            Copyright © 2015 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 10 August 2015

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Acceptance Rates

            WCI '15 Paper Acceptance Rate98of452submissions,22%Overall Acceptance Rate98of452submissions,22%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader