Skip to main content

Generating Cross-Domain Text Classification Corpora from Social Media Comments

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11696))

Abstract

In natural language processing (NLP), cross-domain text classification problems like cross-topic, cross-genre or cross-language authorship attribution are characterized by having different contexts for training and testing data. That is, learning algorithms which are trained on the specific properties of the training data have to make predictions on test data which comprises substantially different properties. To this end, the corpora that are used for analyses in cross-domain problems are limited in size and variation, decreasing the expressive power and generalizability of the proposed solutions. In this paper, we present a methodological framework and toolset for dynamically creating cross-domain datasets by utilizing millions of Reddit comments. We show that different types of cross-domain datasets such as cross-topic or cross-lingual corpora can be constructed, and demonstrate a wide variety of use cases, including previously unfeasible analyses like cross-lingual authorship attribution on original, non-translated texts. Using state-of-the-art authorship attribution methods, we show the potential of a cross-topic corpus generated by our framework when compared to the corpora that were used in related approaches, and enable the advance of research previously limited by corpora availability.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://reddit.com/r/datasets/comments/64o7py/.

  2. 2.

    https://files.pushshift.io/reddit/comments/.

  3. 3.

    Taken from https://www.reddit.com/r/autowikibot/wiki/redditbots.

  4. 4.

    https://github.com/bmurauer/reddit_corpora.

  5. 5.

    https://pypi.org/project/langdetect/.

  6. 6.

    https://github.com/bmurauer/reddit_corpora.

References

  1. Bogdanova, D., Lazaridou, A.: Cross-language authorship attribution. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, pp. 2015–2020 (2014)

    Google Scholar 

  2. Eder, M.: Does size matter? Authorship attribution, small samples, big problem. Digit. Sch. Hum. 30(2), 167–182 (2013)

    Google Scholar 

  3. Gómez-Adorno, H., Posadas-Durán, J.P., Sidorov, G., Pinto, D.: Document embeddings learned on various types of n-grams for cross-topic authorship attribution. Computing 100(7), 741–756 (2018).https://doi.org/10.1007/s00607-018-0587-8

    Article  Google Scholar 

  4. Koppel, M., Schler, J., Argamon, S., Messeri, E.: Authorship attribution with thousands of candidate authors. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press (2006). https://doi.org/10.1145/1148170.1148304

  5. Llorens, M., Delany, S.J.: Deep level lexical features for cross-lingual authorship attribution. In: Proceedings of the First Workshop on Modeling, Learning and Mining for Cross/Multilinguality, pp. 16–25. Dublin Institute of Technology (2016)

    Google Scholar 

  6. Luyckx, K., Daelemans, W.: The effect of author set size and data size in authorship attribution. Literary Linguist. Comput. 26(1), 35–55 (2011). https://doi.org/10.1093/llc/fqq013

    Article  Google Scholar 

  7. Markov, I., Stamatatos, E., Sidorov, G.: Improving cross-topic authorship attribution: the role of pre-processing. In: Gelbukh, A. (ed.) CICLing 2017. LNCS, vol. 10762, pp. 289–302. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77116-8_21

    Chapter  Google Scholar 

  8. Menon, R., Choi, Y.: Domain independent authorship attribution without domain adaptation. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 309–315 (2011)

    Google Scholar 

  9. Murauer, B., Tschuggnall, M., Specht, G.: Dynamic parameter search for cross-domain authorship attribution. Working Notes of CLEF (2018)

    Google Scholar 

  10. Narayanan, A., et al.: On the feasibility of internet-scale author identification. In: 2012 IEEE Symposium on Security and Privacy. IEEE, May 2012. https://doi.org/10.1109/sp.2012.46

  11. Overdorf, R., Greenstadt, R.: Blogs, Twitter feeds, and reddit comments: cross-domain authorship attribution. Proc. Privacy Enhancing Technol. 2016(3), 155–171 (2016)

    Article  Google Scholar 

  12. Posadas-Durán, J.P., Gómez-Adorno, H., Sidorov, G., Batyrshin, I., Pinto, D., Chanona-Hernández, L.: Application of the distributed document representation in the authorship attribution task for small corpora. Soft Computing 21(3), 627–639 (2017). https://doi.org/10.1007/s00500-016-2446-x

    Article  Google Scholar 

  13. Potthast, M., Hagen, M., Stein, B.: Author obfuscation: attacking the state of the art in authorship verification. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2016

    Google Scholar 

  14. Sapkota, U., Bethard, S., Montes, M., Solorio, T.: Not all character n-grams are created equal: a study in authorship attribution. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–102, June 2015

    Google Scholar 

  15. Sapkota, U., Solorio, T., y Gómez, M.M., Bethard, S., Rosso, P.: Cross-topic authorship attribution: will out-of-topic data help? In: Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014), pp. 1228–1237, August 2014

    Google Scholar 

  16. Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. J. Law Policy 21, 421–439 (2013)

    Google Scholar 

  17. Venuti, L.: The Translator’s Invisibility: A History of Translation. Routledge, Abingdon (2017)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Benjamin Murauer or Günther Specht .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Murauer, B., Specht, G. (2019). Generating Cross-Domain Text Classification Corpora from Social Media Comments. In: Crestani, F., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2019. Lecture Notes in Computer Science(), vol 11696. Springer, Cham. https://doi.org/10.1007/978-3-030-28577-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-28577-7_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-28576-0

  • Online ISBN: 978-3-030-28577-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics