ABSTRACT
In recent years, content based audio indexing has become the key research area, as the audio content defines the content more precisely and has comparatively subservient density. In this paper, we present conversion of audio books into textual information using CMU SPHINX-4 speech transcriber and efficient indexing of audio books using term frequency-inverse document frequency (tf-idf) weights on Apache Hadoop MapReduce framework. In the first phase, audiobook datasets are converted into textual words by training CMU SPHINX-4 speech recognizer with acoustic models. In the next phase, the keywords present in the text file generated from the speech recognizer are filtered using tf-idf weights. Finally, we index audio files based on the keywords extracted from the speech converted text file. As, conversion of speech to text and indexing of audio are space and time intensive tasks, we ported execution of these algorithms on Hadoop MapReduce Framework. Porting content based indexing of audio books on to a Hadoop distributed framework resulted in considerable improvement in time and space utilization. As the amount of data being uploaded and downloaded is escalating, this can be further extended to indexing of image, video and other multimedia forms.
- Youtube statitics. https://www.youtube.com/yt/press/statistics.html.Google Scholar
- LR Rabiner and RW Schafer. Digital speech processing. The Froehlich/Kent Encyclopedia of Telecommunications, 6:237--258, 2011.Google Scholar
- Paul Lamere, Philip Kwok, Evandro Gouvea, Bhiksha Raj, Rita Singh, William Walker, Manfred Warmuth, and Peter Wolf. The cmu sphinx-4 speech recognition system. 1:2--5, 2003.Google Scholar
- Sphinx-4. http://cmusphinx.sourceforge.net/doc/sphinx4/overview-summary.html.Google Scholar
- Apache hadoop. http://hadoop.apache.org/.Google Scholar
- Introduction to the hadoop framework. https://blog.safaribooksonline.com/2012/12/07/introduction-to-the-hadoop-framework/.Google Scholar
- Andrzej Bialecki, Michael Cafarella, Doug Cutting, and Owen OâĂŹMALLEY. Hadoop: a framework for running applications on large clusters built of commodity hardware. Wiki at http://lucene.apache.org/hadoop, 11, 2005.Google Scholar
- Hasnain Mandviwala, Scott Blackwell, Chris Weikart, and Jean-Manuel Van Thong. Multimedia content analysis and indexing: Evaluation of a distributed and scalable architecture. pages 137--145, 2003.Google Scholar
- Adrian Sebastian Paul, D Barbulescu, and F Dragan. A distributed system architecture for audio signal processing. pages 137--142, 2011.Google Scholar
- Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for mapreduce. pages 938--948, 2010. Google ScholarDigital Library
- Benjamin Fish, Jeremy Kun, Ádám Dániel Lelkes, Lev Reyzin, and György Turán. On the computational complexity of mapreduce. arXiv preprint arXiv:1410.0245, 2014.Google Scholar
- Librivox website. http://librivox.org/.Google Scholar
- Loyal books free public domain audiobooks and ebook downloads website. http://www.loyalbooks.com/.Google Scholar
- Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. Scoring, term weighting and the vector space model. Introduction to Information Retrieval, 100, 2008.Google Scholar
- Stephen Robertson. Understanding inverse document frequency: on theoretical arguments for idf. Journal of documentation, 60(5):503--520, 2004.Google Scholar
Index Terms
- Content Based Audiobooks Indexing using Apache Hadoop Framework
Recommendations
Design and Development of a Medical Big Data Processing System Based on Hadoop
Secondary use of medical big data is increasingly popular in healthcare services and clinical research. Understanding the logic behind medical big data demonstrates tendencies in hospital information technology and shows great significance for hospital ...
G-Hadoop: MapReduce across distributed data centers for data-intensive computing
Recently, the computational requirements for large-scale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge ...
Color and Texture Feature Extraction Using Apache Hadoop Framework
ICCUBEA '15: Proceedings of the 2015 International Conference on Computing Communication Control and AutomationMultimedia data is expanding exponentially. The rapid growth of technology combined with affordable storage and capabilities has lead to explosion in the availability and applications of multimedia. Most of the data is available in the form of images ...
Comments