Abstract
Kernel Canonical Correlation Analysis (KCCA) is a method of correlating linear relationship between two variables in a kernel defined feature space. A machine learning algorithm based on KCCA is studied for cross-language information retrieval. We apply the algorithm in Japanese–English cross-language information retrieval. The results are quite encouraging and are significantly better than those obtained by other state of the art methods. Computational complexity is an important issue when applying KCCA to large dataset as in information retrieval. We experimentally evaluate several methods to alleviate the problem of applying KCCA to large datasets. We also investigate cross-language document classification using KCCA as well as other methods. Our results show that it is feasible to use a classifier learned in one language to classify the documents in other languages.
Similar content being viewed by others
References
Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48.
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to Support Vector Machines and other kernel-based learning methods. Cambridge, UK: Cambridge University Press.
Cristianini, N., Shawe-Taylor, J., & Lodhi, H. (2002). Latent semantic kernels. Journal of Intelligent Information System, 18(2/3), 127–152.
Hardon, D. R., Szedmark, S., & Shawe-Taylor, J. (2003). Canonical correlation analysis: An overview with application to learning methods. Technical Report CSD-TR-03-02, Department of Computer Science, Royal Holloway, University of London.
Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 312–377.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec & C. Rouveirol (Eds.), Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398 Lecture Notes in Computer Science, Chemnitz, DE (pp. 137–142). Heidelberg, DE: Springer Verlag.
Lewis, D. D., Yang, Y., Rose, T., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5(Apr), 361–397.
Li, Y., & Shawe-Taylor, J. (2003). The SVM with uneven margins and Chinese document categorization. In Proceedings of The 17th Pacific Asia Conference on Language, Information and Computation (PACLIC17), Singapore, Oct (pp. 216–227).
Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J., & Kandola, J. (2002). The perceptron algorithm with uneven margins. In Proceedings of the 9th International Conference on Machine Learning (ICML-2002) (pp. 379–386).
Littman, M. L., Dumais, S. T., & Landauer, T. K. (1998). Automatic cross-language information retrieval using latent semantic indexing. In G. Grefenstette (Ed.), Cross language information retrieval. Dordrecht: Kluwer.
Makita, M., Higuchi, S., Fujii, A., & Ishikawa, T. (2003). A system for Japanese–English–Korean multilingual patent retrieval. In Proceedings of Machine Translation Summit IX. Retrieved Sept., 2003, from http://www.amtaweb.org/summit/MTSummit/papers.html.
Vinokourov, A., Shawe-Taylor, J., & Cristianini, N. (2002). Inferring a semantic representation of text via cross-language correlation analysis. In Advances of neural information processing systems, 15.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Y., Shawe-Taylor, J. Using KCCA for Japanese–English cross-language information retrieval and document classification. J Intell Inf Syst 27, 117–133 (2006). https://doi.org/10.1007/s10844-006-1627-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-006-1627-y