ABSTRACT
In this work, a benchmark to evaluate the retrieval performance of soundtrack recommendation systems is proposed. Such systems aim at finding songs that are played as background music for a given set of images. The proposed benchmark is based on preference judgments, where relevance is considered a continuous ordinal variable and judgments are collected for pairs of songs with respect to a query (i.e., set of images). To capture a wide variety of songs and images, we use a large space of possible music genres, different emotions expressed through music, and various query-image themes. The benchmark consists of two types of relevance assessments: (i) judgments obtained from a user study, that serve as a ``gold standard'' for (ii) relevance judgments gathered through Amazon's Mechanical Turk. We report on the performance of two state-of-the-art soundtrack recommendation systems using the proposed benchmark.
- O. Alonso and R. A. Baeza-Yates. Design and implementation of relevance assessments using crowdsourcing. ECIR, 2011. Google ScholarDigital Library
- O. Alonso, R. Schenkel, and M. Theobald. Crowdsourcing assessments for XML ranked retrieval. ECIR, 2010. Google ScholarDigital Library
- J. Arguello, F. Diaz, J. Callan, and B. Carterette. A methodology for evaluating aggregated search results. ECIR, 2011. Google ScholarDigital Library
- B. Carterette and P. N. Bennett. Evaluation measures for preference judgments. SIGIR, 2008. Google ScholarDigital Library
- B. Carterette, P. N. Bennett, D. M. Chickering, and S. T. Dumais. Here or there: preference judgments for relevance. ECIR, 2008. Google ScholarDigital Library
- B. Carterette and D. Petkova. Learning a ranking from pairwise preferences. SIGIR, 2006. Google ScholarDigital Library
- G. Fechner. Elemente der Psychophysik. Breitkopf und Haertel, 1860.Google Scholar
- Psychpage - General list of feelings. http://www.psychpage.com/learning/library/assess/feelings.html.Google Scholar
- Wikipedia - List of music genres. http://en.wikipedia.org/wiki/List_of_popular_music_genres.Google Scholar
- ImageCLEF - Image Retrieval in CLEF. http://www.imageclef.org/.Google Scholar
- Wikipedia - List of photograpy forms. http://en.wikipedia.org/wiki/Photography.Google Scholar
- R. Janicki. Ranking with partial orders and pairwise comparisons. RSKT, 2008. Google ScholarDigital Library
- K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 2002. Google ScholarDigital Library
- G. Kazai, N. Milic-Frayling, and J. Costello. Towards methods for the collective gathering and quality control of relevance assessments. SIGIR, 2009. Google ScholarDigital Library
- P. J. Lang, M. M. Bradley, and B. N. Cuthbert. International affective picture system (iaps): Affective ratings of pictures and instruction manual. Technical report, University of Florida, 2008.Google Scholar
- Last.Fm - Music portal. http://www.last.fm/.Google Scholar
- C.-T. Li and M.-K. Shan. Emotion-based impressionism slideshow with automatic music accompaniment. ACM Multimedia, 2007. Google ScholarDigital Library
- W. A. Mason and D. J. Watts. Financial incentives and the "performance of crowds". KDD Workshop on Human Computation, 2009. Google ScholarDigital Library
- MIREX - The Music Information Retrieval Evaluation eXchange. http://www.music-ir.org/mirex/wiki/MIREX_HOME.Google Scholar
- Amazon Mechanical Turk. https://www.mturk.com/mturk/welcome.Google Scholar
- Picasa - Photo sharing portal. https://picasaweb.google.com/.Google Scholar
- M. E. Rorvig. The simple scalability of documents. JASIS, 1990.Google ScholarCross Ref
- J. A. Russell. A circumplex model of affect. Journal of personality and social psychology, 1980.Google Scholar
- M. Sanderson, M. L. Paramita, P. Clough, and E. Kanoulas. Do user preferences and evaluation measures line up? SIGIR, 2010. Google ScholarDigital Library
- R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. EMNLP, 2008. Google ScholarDigital Library
- A. Stupar and S. Michel. Picasso - to sing, you must close your eyes and draw. SIGIR, 2011. Google ScholarDigital Library
- A. Stupar and S. Michel. Benchmarking Soundtrack Recommendation Systems with SRBench. CoRR, abs/1308.1224, 2013. Google ScholarDigital Library
- P. Thomas and D. Hawking. Evaluation by comparing result sets in context. CIKM, 2006. Google ScholarDigital Library
- L. Thurstone. A law of comparative judgments. Psychological Review, 1927.Google ScholarCross Ref
- TREC - Text REtrieval Conference. http://trec.nist.gov/.Google Scholar
- TRECVID - TREC Video Retrieval Evaluation. http://trecvid.nist.gov/.Google Scholar
- R. Typke, M. den Hoed, J. de Nooijer, F. Wiering, and R. C. Veltkamp. A ground truth for half a million musical incipits. JDIM, 2005.Google Scholar
- R. Typke, R. C. Veltkamp, and F. Wiering. A measure for evaluating retrieval techniques based on partially ordered ground truth lists. ICME, 2006.Google ScholarCross Ref
- J. Urbano, M. Marrero, D. Martín, and J. Lloréns. Improving the generation of ground truths based on partially ordered lists. ISMIR, 2010.Google Scholar
Index Terms
- SRbench--a benchmark for soundtrack recommendation systems
Recommendations
MUSIB: musical score inpainting benchmark
AbstractMusic inpainting is a sub-task of automated music generation that aims to infill incomplete musical pieces to help musicians in their musical composition process. Many methods have been developed for this task. However, we observe a tendency for ...
SPEC MPI2007—an application benchmark suite for parallel systems using MPI
International Supercomputing Conference (ISC07)The SPEC High-Performance Group has developed the benchmark suite SPEC MPI2007 and its run rules over the last few years. The purpose of the SPEC MPI2007 benchmark and its run rules is to further the cause of fair and objective benchmarking of high-...
Overview of TPC Benchmark E: The Next Generation of OLTP Benchmarks
Performance Evaluation and BenchmarkingSet to replace the aging TPC-C, the TPC Benchmark E is the next generation OLTP benchmark, which more accurately models client database usage. TPC-E addresses the shortcomings of TPC-C. It has a much more complex workload, requires the use of RAID-...
Comments