Abstract
In this article, we report on our participation in the JRS Data-Mining Challenge. The approach used by our system is a lazy-learning one, based on a simple k-nearest-neighbors technique. We more specifically addressed this challenge as an opportunity to test Information Retrieval (IR) inspired techniques in such a data-mining framework. In particular, we tested different similarity measures, including one called vectorization that we have proposed and tested in IR and Natural Language Processing frameworks. The resulting system is simple and efficient while offering good performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Berry, M., Martin, D.: Principal component analysis for information retrieval. In: Kontoghiorghes, E. (ed.) Handbook of Parallel Computing and Statistics. Statistics: A Series of Textbooks and Monographs (2005)
Bourgain, J.: On Lipschitz embedding of finite metric spaces in hilbert space. Israel Journal of Mathematics 52(1) (1985)
Claveau, V., Lefvre, S.: Topic segmentation of tv-streams by mathematical morphology and vectorization. In: Procedings of the InterSpeech Conference, Florence, Italy (2011)
Claveau, V., Tavenard, R., Amsaleg, L.: Vectorisation des processus d’appariement document-requête. In: 7e Conférence en Recherche d’informations et Applications, CORIA 2010, Sousse, Tunisie, pp. 313–324 (March 2010)
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proc. of the 20th ACM Symposium on Computational Geometry, Brooklyn, New York, USA (2004)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3(4-5), 993–1022 (2003)
Dumais, S.: Latent semantic analysis. ARIST Review of Information Science and Technology 38(4) (2004)
Fox, E., Shaw, J.: Combination of multiple searches. In: Proceedings of the 2nd Text Retrieval Conference (TREC-2), pp. 243–252. NIST Special Publication (1994)
Harter, S.: A probabilistic approach to automatic keyword indexing. Journal of the American Society for Information Science 26(6), 197–206 (1975)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proc. of SIGIR, Berkeley, USA (1999)
Lee, J.: Combining multiple evidence from different properties of weighting schemes. In: Proceedings of the 18th Annual ACM-SIGIR, pp. 180–188 (1995)
Lejsek, H., Asmundsson, F., Jónsson, B., Amsaleg, L.: Nv-tree: An efficient disk-based index for approximate search in very large high-dimensional collections. IEEE Trans. on Pattern Analysis and Machine Intelligence 99(1) (2008)
Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal on Research and Development 2(2) (1958)
Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1) (1972)
Spärck Jones, K., Walker, S.G., Robertson, S.E.: Probabilistic model of information retrieval: Development and comparative experiments. Information Processing and Management 36(6) (2000)
Stein, B.: Principles of hash-based text retrieval. In: Proc. of SIGIR, Amsterdam, Pays-Bas (2007)
Vempala, S.: The Random Projection Method. In: Discrete Mathematics and Theoretical Computer Science, vol. 65. AMS (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Claveau, V. (2012). IRISA Participation in JRS 2012 Data-Mining Challenge: Lazy-Learning with Vectorization. In: Yao, J., et al. Rough Sets and Current Trends in Computing. RSCTC 2012. Lecture Notes in Computer Science(), vol 7413. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32115-3_53
Download citation
DOI: https://doi.org/10.1007/978-3-642-32115-3_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32114-6
Online ISBN: 978-3-642-32115-3
eBook Packages: Computer ScienceComputer Science (R0)