Using KCCA for Japanese–English cross-language information retrieval and document classification

Li, Yaoyong; Shawe-Taylor, John

doi:10.1007/s10844-006-1627-y

Using KCCA for Japanese–English cross-language information retrieval and document classification

Published: 07 September 2006

Volume 27, pages 117–133, (2006)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Yaoyong Li¹ &
John Shawe-Taylor²

242 Accesses
36 Citations
Explore all metrics

Abstract

Kernel Canonical Correlation Analysis (KCCA) is a method of correlating linear relationship between two variables in a kernel defined feature space. A machine learning algorithm based on KCCA is studied for cross-language information retrieval. We apply the algorithm in Japanese–English cross-language information retrieval. The results are quite encouraging and are significantly better than those obtained by other state of the art methods. Computational complexity is an important issue when applying KCCA to large dataset as in information retrieval. We experimentally evaluate several methods to alleviate the problem of applying KCCA to large datasets. We also investigate cross-language document classification using KCCA as well as other methods. Our results show that it is feasible to use a classifier learned in one language to classify the documents in other languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-modal Correlated Centroid Space for Multi-lingual Cross-Modal Retrieval

The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model

Experiments on Cross-Language Information Retrieval Using Comparable Corpora of Chinese, Japanese, and Korean Languages

References

Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48.
Article MathSciNet Google Scholar
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to Support Vector Machines and other kernel-based learning methods. Cambridge, UK: Cambridge University Press.
Google Scholar
Cristianini, N., Shawe-Taylor, J., & Lodhi, H. (2002). Latent semantic kernels. Journal of Intelligent Information System, 18(2/3), 127–152.
Article Google Scholar
Hardon, D. R., Szedmark, S., & Shawe-Taylor, J. (2003). Canonical correlation analysis: An overview with application to learning methods. Technical Report CSD-TR-03-02, Department of Computer Science, Royal Holloway, University of London.
Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 312–377.
Article Google Scholar
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec & C. Rouveirol (Eds.), Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398 Lecture Notes in Computer Science, Chemnitz, DE (pp. 137–142). Heidelberg, DE: Springer Verlag.
Google Scholar
Lewis, D. D., Yang, Y., Rose, T., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5(Apr), 361–397.
Google Scholar
Li, Y., & Shawe-Taylor, J. (2003). The SVM with uneven margins and Chinese document categorization. In Proceedings of The 17th Pacific Asia Conference on Language, Information and Computation (PACLIC17), Singapore, Oct (pp. 216–227).
Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J., & Kandola, J. (2002). The perceptron algorithm with uneven margins. In Proceedings of the 9th International Conference on Machine Learning (ICML-2002) (pp. 379–386).
Littman, M. L., Dumais, S. T., & Landauer, T. K. (1998). Automatic cross-language information retrieval using latent semantic indexing. In G. Grefenstette (Ed.), Cross language information retrieval. Dordrecht: Kluwer.
Google Scholar
Makita, M., Higuchi, S., Fujii, A., & Ishikawa, T. (2003). A system for Japanese–English–Korean multilingual patent retrieval. In Proceedings of Machine Translation Summit IX. Retrieved Sept., 2003, from http://www.amtaweb.org/summit/MTSummit/papers.html.
Vinokourov, A., Shawe-Taylor, J., & Cristianini, N. (2002). Inferring a semantic representation of text via cross-language correlation analysis. In Advances of neural information processing systems, 15.

Download references

Author information

Authors and Affiliations

Department of Computer Science, The University of Sheffield, Sheffield, UK
Yaoyong Li
ISIS Group, School of Electronics and Computer Science, University of Southampton, Southampton, UK
John Shawe-Taylor

Authors

Yaoyong Li
View author publications
You can also search for this author in PubMed Google Scholar
John Shawe-Taylor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yaoyong Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., Shawe-Taylor, J. Using KCCA for Japanese–English cross-language information retrieval and document classification. J Intell Inf Syst 27, 117–133 (2006). https://doi.org/10.1007/s10844-006-1627-y

Download citation

Received: 22 March 2004
Revised: 19 January 2005
Accepted: 20 April 2006
Published: 07 September 2006
Issue Date: September 2006
DOI: https://doi.org/10.1007/s10844-006-1627-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using KCCA for Japanese–English cross-language information retrieval and document classification

Abstract

Access this article

Similar content being viewed by others

Multi-modal Correlated Centroid Space for Multi-lingual Cross-Modal Retrieval

The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model

Experiments on Cross-Language Information Retrieval Using Comparable Corpora of Chinese, Japanese, and Korean Languages

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using KCCA for Japanese–English cross-language information retrieval and document classification

Abstract

Access this article

Similar content being viewed by others

Multi-modal Correlated Centroid Space for Multi-lingual Cross-Modal Retrieval

The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model

Experiments on Cross-Language Information Retrieval Using Comparable Corpora of Chinese, Japanese, and Korean Languages

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation