Iterative Refining of Category Profiles for Nearest Centroid Cross-Domain Text Classification

Domeniconi, Giacomo; Moro, Gianluca; Pasolini, Roberto; Sartori, Claudio

doi:10.1007/978-3-319-25840-9_4

Giacomo Domeniconi¹⁵,
Gianluca Moro¹⁵,
Roberto Pasolini¹⁵ &
…
Claudio Sartori¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 553))

Included in the following conference series:

International Joint Conference on Knowledge Discovery, Knowledge Engineering, and Knowledge Management

781 Accesses
12 Citations

Abstract

In cross-domain text classification, topic labels for documents of a target domain are predicted by leveraging knowledge of labeled documents of a source domain, having equal or similar topics with possibly different words. Existing methods either adapt documents of the source domain to the target or represent both domains in a common space. These methods are mostly based on advanced statistical techniques and often require tuning of parameters in order to obtain optimal performances. We propose a more straightforward approach based on nearest centroid classification: profiles of topic categories are extracted from the source domain and are then adapted by iterative refining steps using most similar documents in the target domain. Experiments on common benchmark datasets show that this approach, despite its simplicity, obtains accuracy measures better or comparable to other methods, obtained with fixed empirical values for its few parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://qwone.com/~jason/20Newsgroups/ (we used the bydate distribution).
2.
http://people.cs.umass.edu/~mccallum/data/sraa.tar.gz.
3.
http://www.cse.ust.hk/TL/dataset/Reuters.zip.

References

Blitzer, J., McDonald, R., Pereira, F.: Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 120–128. Association for Computational Linguistics (2006)
Google Scholar
Bollegala, D., Weir, D., Carroll, J.: Cross-domain sentiment classification using a sentiment sensitive thesaurus. IEEE Trans. Knowl. Data Eng. 25(8), 1719–1731 (2013)
Article Google Scholar
Cheeti, S., Stanescu, A., Caragea, D.: Cross-domain sentiment classification using an adapted naive bayes approach and features derived from syntax trees. In: Proceedings of KDIR 2013, 5th International Conference on Knowledge Discovery and Information Retrieval, pp. 169–176 (2013)
Google Scholar
Dai, W., Xue, G.-R., Yang, Q., Yu, Y.: Co-clustering based classification for out-of-domain documents. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 210–219. ACM (2007)
Google Scholar
Dai, W., Xue, G.-R., Yang, Q., Yu, Y.: Transferring naive bayes classifiers for text classification. In: Proceedings of the AAAI 2007, 22nd National Conference on Artificial Intelligence, pp. 540–545 (2007)
Google Scholar
Hal Daumé III. Frustratingly easy domain adaptation. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 256–263 (2007)
Google Scholar
Domeniconi, G., Masseroli, M., Moro, G., Pinoli, M.: Discovering new gene functionalities from random perturbations of known gene ontological annotations. In: Proceedings of the 6th International Conference on Knowledge Discovery and Information Retrieval (2014)
Google Scholar
Domeniconi, G., Masseroli, M., Moro, G., Pinoli, M.: Random perturbations and term weighting of gene ontology annotations for unknown gene function discovering. In: Fred, A. et al. (eds.) IC3K 2014. CCIS, vol. 553, pp. xx–yy. Springer, Heidelberg (2015)
Google Scholar
Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: Cross-domain text classification through iterative refining of target categories representations. In: Proceedings of the 6th International Conference on Knowledge Discovery and Information Retrieval (2014)
Google Scholar
Gao, J., Fan, W., Jiang, J., Han, J.: Knowledge transfer via multiple model local structure mapping. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 283–291. ACM (2008)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Hosmer Jr., D.W., Lemeshow, S.: Applied Logistic Regression. Wiley, New York (2004)
MATH Google Scholar
Huang, J., Smola, A.J., Gretton, A., Borgwardt, K.M., Schölkopf, B.: Correcting sample selection bias by unlabeled data. Adv. Neural Inf. Process. Syst. 19, 601–608 (2007)
Google Scholar
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of ICML 1997, 14th International Conference on Machine Learning, pp. 143–151 (1997)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Google Scholar
Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)
Article Google Scholar
Li, L., Jin, X., Long, M.: Topic correlation analysis for cross-domain text classification. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (2012)
Google Scholar
Ling, X., Dai, W., Xue, G.-R., Yang, Q., Yu, Y.: Spectral domain-transfer learning. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 488–496. ACM (2008)
Google Scholar
McCallum, A., Nigam, K., et al.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48. Citeseer (1998)
Google Scholar
Minka, T.P.: A comparison of numerical optimizers for logistic regression. http://research.microsoft.com/en-us/um/people/minka/papers/logreg/ (2003)
Sinno Jialin Pan and Qiang Yang: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Article MathSciNet Google Scholar
Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plann. Infer. 90(2), 227–244 (2000)
Article MathSciNet MATH Google Scholar
Sugiyama, M., Nakajima, S., Kashima, H., Von Buenau, P., Kawanabe, M.: Direct importance estimation with model selection and its application to covariate shift adaptation. Advances in Neural Information Processing Systems 2007, vol. 20, pp. 1433–1440 (2007)
Google Scholar
Wang, P., Domeniconi, C., Hu, J.: Using Wikipedia for co-clustering based cross-domain text classification. In: ICDM 2008, 8th IEEE International Conference on Data Mining, pp. 1085–1090. IEEE (2008)
Google Scholar
Xiang, E.W., Cao, B., Hu, D.H., Yang, Q.: Bridging domains using world wide knowledge for transfer learning. IEEE Trans. Knowl. Data Eng. 22(6), 770–783 (2010)
Article Google Scholar
Xue, G.-R., Dai, W., Yang, Q., Yu, Y.: Topic-bridged PLSA for cross-domain text classification. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 627–634. ACM (2008)
Google Scholar
Zadrozny, B.: Learning and evaluating classifiers under sample selection bias. In: Proceedings of the 21st International Conference on Machine Learning, pp. 114. ACM (2004)
Google Scholar
Zhuang, F., Luo, P., Xiong, H., He, Q., Xiong, Y., Shi, Z.: Exploiting associations between word clusters and document classes for cross-domain text categorization. Stat. Anal. Data Min. 4(1), 100–114 (2011)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Università di Bologna, Bologna (BO), Italy
Giacomo Domeniconi, Gianluca Moro, Roberto Pasolini & Claudio Sartori

Authors

Giacomo Domeniconi
View author publications
You can also search for this author in PubMed Google Scholar
Gianluca Moro
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Pasolini
View author publications
You can also search for this author in PubMed Google Scholar
Claudio Sartori
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roberto Pasolini .

Editor information

Editors and Affiliations

Instituto de Telecomunicações, Lisboa, Portugal
Ana Fred
Delft University of Technology, Delft, Zuid-Holland, The Netherlands
Jan L. G. Dietz
University of Madeira, Funchal, Portugal
David Aveiro
Henley Business School, University of Reading, Reading, United Kingdom
Kecheng Liu
INSTICC, Setubal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Domeniconi, G., Moro, G., Pasolini, R., Sartori, C. (2015). Iterative Refining of Category Profiles for Nearest Centroid Cross-Domain Text Classification. In: Fred, A., Dietz, J., Aveiro, D., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2014. Communications in Computer and Information Science, vol 553. Springer, Cham. https://doi.org/10.1007/978-3-319-25840-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-25840-9_4
Published: 28 October 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25839-3
Online ISBN: 978-3-319-25840-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics