Skip to main content

News Gathering: Leveraging Transformers to Rank News

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2024)

Abstract

News media outlets disseminate information across various platforms. Often, these posts present complementary content and perspectives on the same news story. However, to compile a set of related news articles, users must thoroughly scour multiple sources and platforms, manually identifying which publications pertain to the same story. This tedious process hinders the speed at which journalists can perform essential tasks, notably fact-checking. To tackle this problem, we created a dataset containing both related and unrelated news pairs. This dataset allows us to develop information retrieval models grounded in the principle of binary relevance. Recognizing that many Transformer-based models might be suited for this task but could overemphasize relationships based on lexical connections, we tailored a dataset to fine-tune these models to focus on semantically relevant connections in the news domain. To craft this dataset, we introduced a methodology to identify pairs of news stories that are lexically similar yet refer to different events and pairs that discuss the same event but have distinct lexical structures. This design compels Transformers to recognize semantic connections between stories, even when their lexical similarities might be absent. Following a human-annotation assessment, we reveal that BERT outperformed other techniques, excelling even in challenging test cases. To ensure the reproducibility of our approach, we have made the dataset and top-performing models publicly available.

Supported by the Millennium Institute for Foundational Research on Data (ANID ICM grant ICN17-002) and ANID PIA National Center of Artificial Intelligence, grant FB210017. Mr. Mendoza acknowledges funding by ANID Fondecyt grant 1200211.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.clickworker.com/.

  2. 2.

    https://huggingface.co/EleutherAI/gpt-j-6b.

  3. 3.

    https://huggingface.co/tiiuae/falcon-7b.

  4. 4.

    https://huggingface.co/tloen/alpaca-lora-7b.

References

  1. Agirre, E., et al.: Semeval-2014 task 10: multilingual semantic textual similarity. In: Proceedings of the 8th International Workshop on Semantic Evaluation, SemEval@COLING 2014, Dublin, Ireland, August 23–24, 2014. pp. 81–91. The Association for Computer Linguistics (2014)

    Google Scholar 

  2. Agirre, E., Cer, D.M., Diab, M.T., Gonzalez-Agirre, A., Guo, W.: *sem 2013 shared task: semantic textual similarity. In: *SEMEVAL (2013)

    Google Scholar 

  3. Cer, D.M., Diab, M.T., Agirre, E., Lopez-Gazpio, I., Specia, L.: Semeval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL 2017, Vancouver, Canada, August 3–4, 2017, pp. 1–14. The Association for Computational Linguistics (2017)

    Google Scholar 

  4. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019)

    Google Scholar 

  5. Gao, T., Yao, X., Chen, D.: SimCSE: Simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event/Punta Cana, Dominican Republic, 7–11 November, 2021, pp. 6894–6910. Association for Computational Linguistics (2021)

    Google Scholar 

  6. Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: The Tenth International Conference on Learning Representations, ICLR, Virtual Event, April 25–29, 2022. OpenReview.net (2022)

    Google Scholar 

  7. Humeau, S., Shuster, K., Lachaux, M., Weston, J.: Poly-encoders: architectures and pre-training strategies for fast and accurate multi-sentence scoring. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020 (2020)

    Google Scholar 

  8. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019, pp. 3980–3990. Association for Computational Linguistics (2019)

    Google Scholar 

  9. Touvron, H., et al.: LLaMA: open and efficient foundation language models. CoRR abs/2302.13971 (2023)

    Google Scholar 

  10. Urbanek, J., et al.: Learning to speak and act in a fantasy text adventure game. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019, pp. 673–683. Association for Computational Linguistics (2019)

    Google Scholar 

  11. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA, pp. 5998–6008 (2017)

    Google Scholar 

  12. Wang, B., Komatsuzaki, A.: GPT-J-6B: a 6 billion parameter autoregressive language model. https://github.com/kingoflolz/mesh-transformer-jax (2021)

  13. Zhang, R., et al.: LlaMA-adapter: efficient fine-tuning of language models with zero-init attention. CoRR abs/2303.16199 (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcelo Mendoza .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Muñoz, C., Apolo, M.J., Ojeda, M., Lobel, H., Mendoza, M. (2024). News Gathering: Leveraging Transformers to Rank News. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14610. Springer, Cham. https://doi.org/10.1007/978-3-031-56063-7_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-56063-7_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-56062-0

  • Online ISBN: 978-3-031-56063-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics