research-article

Content Based Audiobooks Indexing using Apache Hadoop Framework

Authors:
Sonal Shetty

B.V.B.C.E.T, Vidya Nagar, HUBLI

B.V.B.C.E.T, Vidya Nagar, HUBLI
View Profile

,
Akash Sabarad

B.V.B.C.E.T, Vidya Nagar, HUBLI

B.V.B.C.E.T, Vidya Nagar, HUBLI
View Profile

,
Harish Hebballi

B.V.B.C.E.T, Vidya Nagar, HUBLI

B.V.B.C.E.T, Vidya Nagar, HUBLI
View Profile

,
Moula Husain

B.V.B.C.E.T, Vidya Nagar, HUBLI

B.V.B.C.E.T, Vidya Nagar, HUBLI
View Profile

,
S. M. Meena

B.V.B.C.E.T, Vidya Nagar, HUBLI

B.V.B.C.E.T, Vidya Nagar, HUBLI
View Profile

,
Shiddu Nagaralli

B.V.B.C.E.T, Vidya Nagar, HUBLI

B.V.B.C.E.T, Vidya Nagar, HUBLI
View Profile

WCI '15: Proceedings of the Third International Symposium on Women in Computing and InformaticsAugust 2015Pages 496–501https://doi.org/10.1145/2791405.2791485

Published:10 August 2015Publication History

WCI '15: Proceedings of the Third International Symposium on Women in Computing and Informatics

Pages 496–501

ABSTRACT

In recent years, content based audio indexing has become the key research area, as the audio content defines the content more precisely and has comparatively subservient density. In this paper, we present conversion of audio books into textual information using CMU SPHINX-4 speech transcriber and efficient indexing of audio books using term frequency-inverse document frequency (tf-idf) weights on Apache Hadoop MapReduce framework. In the first phase, audiobook datasets are converted into textual words by training CMU SPHINX-4 speech recognizer with acoustic models. In the next phase, the keywords present in the text file generated from the speech recognizer are filtered using tf-idf weights. Finally, we index audio files based on the keywords extracted from the speech converted text file. As, conversion of speech to text and indexing of audio are space and time intensive tasks, we ported execution of these algorithms on Hadoop MapReduce Framework. Porting content based indexing of audio books on to a Hadoop distributed framework resulted in considerable improvement in time and space utilization. As the amount of data being uploaded and downloaded is escalating, this can be further extended to indexing of image, video and other multimedia forms.

References

Youtube statitics. https://www.youtube.com/yt/press/statistics.html.Google Scholar
LR Rabiner and RW Schafer. Digital speech processing. The Froehlich/Kent Encyclopedia of Telecommunications, 6:237--258, 2011.Google Scholar
Paul Lamere, Philip Kwok, Evandro Gouvea, Bhiksha Raj, Rita Singh, William Walker, Manfred Warmuth, and Peter Wolf. The cmu sphinx-4 speech recognition system. 1:2--5, 2003.Google Scholar
Sphinx-4. http://cmusphinx.sourceforge.net/doc/sphinx4/overview-summary.html.Google Scholar
Apache hadoop. http://hadoop.apache.org/.Google Scholar
Introduction to the hadoop framework. https://blog.safaribooksonline.com/2012/12/07/introduction-to-the-hadoop-framework/.Google Scholar
Andrzej Bialecki, Michael Cafarella, Doug Cutting, and Owen OâĂ&Zacute;MALLEY. Hadoop: a framework for running applications on large clusters built of commodity hardware. Wiki at http://lucene.apache.org/hadoop, 11, 2005.Google Scholar
Hasnain Mandviwala, Scott Blackwell, Chris Weikart, and Jean-Manuel Van Thong. Multimedia content analysis and indexing: Evaluation of a distributed and scalable architecture. pages 137--145, 2003.Google Scholar
Adrian Sebastian Paul, D Barbulescu, and F Dragan. A distributed system architecture for audio signal processing. pages 137--142, 2011.Google Scholar
Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for mapreduce. pages 938--948, 2010. Google ScholarDigital Library
Benjamin Fish, Jeremy Kun, Ádám Dániel Lelkes, Lev Reyzin, and György Turán. On the computational complexity of mapreduce. arXiv preprint arXiv:1410.0245, 2014.Google Scholar
Librivox website. http://librivox.org/.Google Scholar
Loyal books free public domain audiobooks and ebook downloads website. http://www.loyalbooks.com/.Google Scholar
Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. Scoring, term weighting and the vector space model. Introduction to Information Retrieval, 100, 2008.Google Scholar
Stephen Robertson. Understanding inverse document frequency: on theoretical arguments for idf. Journal of documentation, 60(5):503--520, 2004.Google Scholar

Index Terms

Content Based Audiobooks Indexing using Apache Hadoop Framework
1. Information systems

Recommendations

Design and Development of a Medical Big Data Processing System Based on Hadoop

Secondary use of medical big data is increasingly popular in healthcare services and clinical research. Understanding the logic behind medical big data demonstrates tendencies in hospital information technology and shows great significance for hospital ...
Read More
G-Hadoop: MapReduce across distributed data centers for data-intensive computing

Recently, the computational requirements for large-scale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge ...
Read More
Color and Texture Feature Extraction Using Apache Hadoop Framework
ICCUBEA '15: Proceedings of the 2015 International Conference on Computing Communication Control and Automation

Multimedia data is expanding exponentially. The rapid growth of technology combined with affordable storage and capabilities has lead to explosion in the availability and applications of multimedia. Most of the data is available in the form of images ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WCI '15: Proceedings of the Third International Symposium on Women in Computing and Informatics
August 2015
763 pages
ISBN:9781450333610
DOI:10.1145/2791405
Editor:
Indu Nair
SCMS, Kochi, India
,
General Chairs:
Sushmita Mitra
Indian Statistical Institute, Kolkata, India
,
Ljiljana Trajković
Simon Fraser University, Canada
,
Program Chairs:
Punam Bedi
University of Delhi, India
,
Suzanne McIntosh
New York University and Cloudera Inc., USA
,
M. S. Rajasree
IIITM-K, Trivandrum, India
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 August 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Hadoop
MapReduce
tf-idf and CMU SPHINX-4
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
WCI '15 Paper Acceptance Rate98of452submissions,22%Overall Acceptance Rate98of452submissions,22%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 133
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Content Based Audiobooks Indexing using Apache Hadoop Framework

WCI '15: Proceedings of the Third International Symposium on Women in Computing and Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Design and Development of a Medical Big Data Processing System Based on Hadoop

G-Hadoop: MapReduce across distributed data centers for data-intensive computing

Color and Texture Feature Extraction Using Apache Hadoop Framework