Skip to main content

A Challenging Dataset for Bias Detection: The Case of the Crisis in the Ukraine

  • Conference paper
  • First Online:
Social, Cultural, and Behavioral Modeling (SBP-BRiMS 2019)

Abstract

The use of disinformation and purposefully biased reportage to sway public opinion has become a serious concern. We present a new dataset related to the Ukrainian Crisis of 2014–2015 which can be used by other researchers to train, test, and compare bias detection algorithms. The dataset comprises 4,538 articles in English related to the crisis from 227 news sources in 43 countries (including the Ukraine) comprising 1.7M words. We manually classified the bias of each article as either pro-Russian, pro-Western, or Neutral, and also aligned each article with a master timeline of 17 major events. When trained on the whole dataset a simple baseline SVM classifier using doc2vec embeddings as features achieves an \(F_{1}\) score of 0.86. This performance is deceptively high, however, because (1) the model is almost completely unable to correctly classify articles published in the Ukraine (0.07 \(F_{1}\)), and (2) the model performs nearly as well when trained on unrelated geopolitics articles written by the same publishers and tested on the dataset. As has been pointed out by other researchers, these results suggest that models of this type are learning journalistic styles rather than actually modeling bias. This implies that more sophisticated approaches will be necessary for true bias detection and classification, and this dataset can serve as an incisive test of new approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    As classified by the news media site AllSides, https://www.allsides.com/unbiased-balanced-news.

  2. 2.

    https://www.vox.com/2019/2/14/18222167/trump-border-security-deal.

  3. 3.

    http://www.fox5dc.com/news/border-wall-national-emergency-government-funding-trump.

  4. 4.

    https://sputniknews.com/.

  5. 5.

    The second author is an undergraduate researcher majoring in International Relations and specializing in Russia.

  6. 6.

    Sputnik uses the word “Topics” to refer to their article categories, though these serve the same organizing purpose as Wikipedia’s events.

References

  1. Baly, R., Karadzhov, G., Alexandrov, D., Glass, J., Nakov, P.: Predicting Factuality of Reporting and Bias of News Media Sources (2018)

    Google Scholar 

  2. Baumer, E.P.S., Elovic, E., Qin, Y.C., Polletta, F., Gay, G.K.: Testing and comparing computational approaches for identifying the language of framing in political news. In: ACL, pp. 1472–1482 (2015)

    Google Scholar 

  3. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media, Newton (2009)

    MATH  Google Scholar 

  4. Card, D., Boydstun, A.E., Gross, J.H., Resnik, P., Smith, N.A.: The media frames corpus: annotations of frames across issues. In: Proceedings of the 53rd Annual Meeting of the ACM and the 7th International Joint Conference on Natural Language Processing (vol. 2: Short Papers) (2015). https://doi.org/10.3115/v1/p15-2072

  5. Chawla, N., Bowyer, K.: SMOTE: synthetic minority over-sampling technique Nitesh. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953

    Article  MATH  Google Scholar 

  6. Field, A., Kliger, D., Wintner, S., Pan, J., Jurafsky, D., Tsvetkov, Y.: Framing and Agenda-Setting in Russian News: a Computational Analysis of Intricate Political Strategies (2018)

    Google Scholar 

  7. Grimmer, J., Stewart, B.M.: Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit. Anal. 21(3), 267–297 (2013). https://doi.org/10.1093/pan/mps028

    Article  Google Scholar 

  8. Hamborg, F., Donnay, K., Gipp, B.: Automated identification of media bias in news articles: an interdisciplinary literature review. Int. J. Digit. Libr. (2018). https://doi.org/10.1007/s00799-018-0261-y

  9. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91

    Chapter  Google Scholar 

  10. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning – with Applications in R. Springer Texts in Statistics, vol. 103. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-7138-7

    Book  MATH  Google Scholar 

  11. Krestel, R., Wall, A., Nejdl, W.: Treehugger or petrolhead? In: Proceedings of the 21st International Conference Companion on World Wide Web - WWW 2012 Companion, p. 547 (2012). https://doi.org/10.1145/2187980.2188120

  12. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. CoRR abs/1405.4053 (2014)

    Google Scholar 

  13. Nimmo, B.: Anatomy of an info-war: how Russia’s propaganda machine works, and how to counter it. Technical report, Central European Policy Institute (CEPI) (2015)

    Google Scholar 

  14. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  15. Peters, M.E., Lecocq, D.: Content extraction using diverse feature sets. In: Proceedings of the 22nd International Conference on World Wide Web, WWW 2013 Companion, pp. 89–90. ACM, New York (2013). https://doi.org/10.1145/2487788.2487828

  16. Project, G.: Gnu Wget 1.20 Manual (2018). https://www.gnu.org/software/wget/manual/

  17. Recasens, M., Danescu-Niculescu-Mizil, C., Jurafsky, D.: Linguistic models for analyzing and detecting biased language. In: Proceedings of the 51st Annual Meeting on ACM, pp. 1650–1659 (2013)

    Google Scholar 

  18. Sharma, K., Qian, F., Jiang, H., Ruchansky, N., Zhang, M., Liu, Y.: Combating Fake News: A Survey on Identification and Mitigation Techniques, vol. 37, no. 4 (2019). https://doi.org/10.1145/1122445.1122456

  19. Zhou, X., Zafarani, R.: Fake News: A Survey of Research, Detection Methods, and Opportunities (2018). https://doi.org/10.13140/RG.2.2.25075.37926

Download references

Acknowledgements

This work was supported by Office of Naval Research (ONR) grant number N00014-17-1-2983.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark A. Finlayson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cremisini, A., Aguilar, D., Finlayson, M.A. (2019). A Challenging Dataset for Bias Detection: The Case of the Crisis in the Ukraine. In: Thomson, R., Bisgin, H., Dancy, C., Hyder, A. (eds) Social, Cultural, and Behavioral Modeling. SBP-BRiMS 2019. Lecture Notes in Computer Science(), vol 11549. Springer, Cham. https://doi.org/10.1007/978-3-030-21741-9_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-21741-9_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-21740-2

  • Online ISBN: 978-3-030-21741-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics