Skip to main content

Iterative Refining of Category Profiles for Nearest Centroid Cross-Domain Text Classification

  • Conference paper
  • First Online:
Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2014)

Abstract

In cross-domain text classification, topic labels for documents of a target domain are predicted by leveraging knowledge of labeled documents of a source domain, having equal or similar topics with possibly different words. Existing methods either adapt documents of the source domain to the target or represent both domains in a common space. These methods are mostly based on advanced statistical techniques and often require tuning of parameters in order to obtain optimal performances. We propose a more straightforward approach based on nearest centroid classification: profiles of topic categories are extracted from the source domain and are then adapted by iterative refining steps using most similar documents in the target domain. Experiments on common benchmark datasets show that this approach, despite its simplicity, obtains accuracy measures better or comparable to other methods, obtained with fixed empirical values for its few parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://qwone.com/~jason/20Newsgroups/ (we used the bydate distribution).

  2. 2.

    http://people.cs.umass.edu/~mccallum/data/sraa.tar.gz.

  3. 3.

    http://www.cse.ust.hk/TL/dataset/Reuters.zip.

References

  1. Blitzer, J., McDonald, R., Pereira, F.: Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 120–128. Association for Computational Linguistics (2006)

    Google Scholar 

  2. Bollegala, D., Weir, D., Carroll, J.: Cross-domain sentiment classification using a sentiment sensitive thesaurus. IEEE Trans. Knowl. Data Eng. 25(8), 1719–1731 (2013)

    Article  Google Scholar 

  3. Cheeti, S., Stanescu, A., Caragea, D.: Cross-domain sentiment classification using an adapted naive bayes approach and features derived from syntax trees. In: Proceedings of KDIR 2013, 5th International Conference on Knowledge Discovery and Information Retrieval, pp. 169–176 (2013)

    Google Scholar 

  4. Dai, W., Xue, G.-R., Yang, Q., Yu, Y.: Co-clustering based classification for out-of-domain documents. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 210–219. ACM (2007)

    Google Scholar 

  5. Dai, W., Xue, G.-R., Yang, Q., Yu, Y.: Transferring naive bayes classifiers for text classification. In: Proceedings of the AAAI 2007, 22nd National Conference on Artificial Intelligence, pp. 540–545 (2007)

    Google Scholar 

  6. Hal Daumé III. Frustratingly easy domain adaptation. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 256–263 (2007)

    Google Scholar 

  7. Domeniconi, G., Masseroli, M., Moro, G., Pinoli, M.: Discovering new gene functionalities from random perturbations of known gene ontological annotations. In: Proceedings of the 6th International Conference on Knowledge Discovery and Information Retrieval (2014)

    Google Scholar 

  8. Domeniconi, G., Masseroli, M., Moro, G., Pinoli, M.: Random perturbations and term weighting of gene ontology annotations for unknown gene function discovering. In: Fred, A. et al. (eds.) IC3K 2014. CCIS, vol. 553, pp. xx–yy. Springer, Heidelberg (2015)

    Google Scholar 

  9. Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: Cross-domain text classification through iterative refining of target categories representations. In: Proceedings of the 6th International Conference on Knowledge Discovery and Information Retrieval (2014)

    Google Scholar 

  10. Gao, J., Fan, W., Jiang, J., Han, J.: Knowledge transfer via multiple model local structure mapping. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 283–291. ACM (2008)

    Google Scholar 

  11. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

  12. Hosmer Jr., D.W., Lemeshow, S.: Applied Logistic Regression. Wiley, New York (2004)

    MATH  Google Scholar 

  13. Huang, J., Smola, A.J., Gretton, A., Borgwardt, K.M., Schölkopf, B.: Correcting sample selection bias by unlabeled data. Adv. Neural Inf. Process. Syst. 19, 601–608 (2007)

    Google Scholar 

  14. Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of ICML 1997, 14th International Conference on Machine Learning, pp. 143–151 (1997)

    Google Scholar 

  15. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Google Scholar 

  16. Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)

    Article  Google Scholar 

  17. Li, L., Jin, X., Long, M.: Topic correlation analysis for cross-domain text classification. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (2012)

    Google Scholar 

  18. Ling, X., Dai, W., Xue, G.-R., Yang, Q., Yu, Y.: Spectral domain-transfer learning. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 488–496. ACM (2008)

    Google Scholar 

  19. McCallum, A., Nigam, K., et al.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48. Citeseer (1998)

    Google Scholar 

  20. Minka, T.P.: A comparison of numerical optimizers for logistic regression. http://research.microsoft.com/en-us/um/people/minka/papers/logreg/ (2003)

  21. Sinno Jialin Pan and Qiang Yang: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)

    Google Scholar 

  22. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)

    Article  Google Scholar 

  23. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  24. Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plann. Infer. 90(2), 227–244 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  25. Sugiyama, M., Nakajima, S., Kashima, H., Von Buenau, P., Kawanabe, M.: Direct importance estimation with model selection and its application to covariate shift adaptation. Advances in Neural Information Processing Systems 2007, vol. 20, pp. 1433–1440 (2007)

    Google Scholar 

  26. Wang, P., Domeniconi, C., Hu, J.: Using Wikipedia for co-clustering based cross-domain text classification. In: ICDM 2008, 8th IEEE International Conference on Data Mining, pp. 1085–1090. IEEE (2008)

    Google Scholar 

  27. Xiang, E.W., Cao, B., Hu, D.H., Yang, Q.: Bridging domains using world wide knowledge for transfer learning. IEEE Trans. Knowl. Data Eng. 22(6), 770–783 (2010)

    Article  Google Scholar 

  28. Xue, G.-R., Dai, W., Yang, Q., Yu, Y.: Topic-bridged PLSA for cross-domain text classification. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 627–634. ACM (2008)

    Google Scholar 

  29. Zadrozny, B.: Learning and evaluating classifiers under sample selection bias. In: Proceedings of the 21st International Conference on Machine Learning, pp. 114. ACM (2004)

    Google Scholar 

  30. Zhuang, F., Luo, P., Xiong, H., He, Q., Xiong, Y., Shi, Z.: Exploiting associations between word clusters and document classes for cross-domain text categorization. Stat. Anal. Data Min. 4(1), 100–114 (2011)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roberto Pasolini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Domeniconi, G., Moro, G., Pasolini, R., Sartori, C. (2015). Iterative Refining of Category Profiles for Nearest Centroid Cross-Domain Text Classification. In: Fred, A., Dietz, J., Aveiro, D., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2014. Communications in Computer and Information Science, vol 553. Springer, Cham. https://doi.org/10.1007/978-3-319-25840-9_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25840-9_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25839-3

  • Online ISBN: 978-3-319-25840-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics