Skip to main content
Log in

Matching news articles and wikipedia tables for news augmentation

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Nowadays, digital-news understanding is often overwhelmed by the deluge of online information. One approach to cover this gap is to outline the news story by highlighting the most relevant facts. For example, recent studies summarize news articles by generating representative headlines. In this paper, we go beyond and argue news understanding can also be enhanced by surfacing contextual data relevant to the article, such as structured web tables. Specifically, our goal is to match news articles and web tables for news augmentation. For that, we introduce a novel BERT-based attention model to compute this matching degree. Through an extensive experimental evaluation over Wikipedia tables, we compare the performance of our model with standard IR techniques, document/sentence encoders and neural IR models for this task. The overall results point out our model outperforms all baselines at different levels of accuracy and in the mean reciprocal ranking measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Availability of data and materials

The datasets and source code are public available in: https://github.com/levysouza/News-Table-Matching.

Notes

  1. The Web has over 14.1B tables Cafarella et al. [3].

  2. https://news.google.com.

  3. https://news.microsoft.com.

  4. https://artincontext.org/most-expensive-paintings.

  5. We demonstrate its performance in our ablation study (see results in Table 8).

  6. https://github.com/levysouza/News-Table-Matching.

  7. The BERT architecture has over 340 million parameters.

  8. We also try Long Short Term Memory (LSTM) for this step but GRUs achieved better results.

  9. Our ablation study shows we can improve the model performance by joining such two attention methodologies (see results in Table 8).

  10. https://tinyurl.com/ranking-loss.

  11. We try the following similarity thresholds for the cosine distance over positive and negative pairs: 0.3, 0.4, 0.5, 0.6, and 0.7, in which 0.3 achieves the best results in our validation dataset.

  12. https://newspaper.readthedocs.io/en/latest/.

  13. https://www.elastic.co.

  14. https://fasttext.cc/.

  15. https://scikit-learn.org.

  16. https://pypi.org/project/rank-bm25.

  17. https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html.

  18. https://tfhub.dev/google/universal-sentence-encoder/4.

  19. https://github.com/hanxiao/bert-as-service.

  20. https://huggingface.co/transformers/model_doc/bert.html.

  21. https://github.com/NTMC-Community/MatchZoo.

  22. https://www.sbert.net/docs/pretrained-models/dpr.html.

  23. We do not combine the table-content and its surrounding text since the table-aspects achieve the worst results for the evaluated metric.

  24. The training time for cross and bi-encoder BERT-based models are similar as both of them are composed of the BERT architecture. In our experiments, their fine-tuning time is over 20min per epoch.

References

  1. Agarwal S, Singh NK, Meel P (2018) Single-document summarization using sentence embeddings and k-means clustering. In: Proceedings of the 2018 international conference on advances in computing, communication control and networking, IEEE, pp 162–165

  2. Bhagavatula CS, Noraset T, Downey D (2013) Methods for exploring and mining tables on wikipedia. In: Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics. ACM, pp 18–26

  3. Cafarella MJ, Halevy AY, Wang DZ et al (2008) Webtables: exploring the power of tables on the web. Proc VLDB Endow 1(1):538–549

    Article  Google Scholar 

  4. Cafarella MJ, Halevy AY, Lee H et al (2018) Ten years of webtables. Proc VLDB Endow 11(12):2140–2149

    Article  Google Scholar 

  5. Cer D, Yang Y, Kong SY, et al (2018) Universal sentence encoder. CoRR abs/1803.11175

  6. Chakrabarti K, Chen Z, Shakeri S, et al (2020) Tableqna: Answering list intent queries with web tables. CoRR abs/2001.04828

  7. Chen X, Cheng Y, Wang S, et al (2021) Earlybert: Efficient BERT training via early-bird lottery tickets. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing. Association for Computational Linguistics, Virtual Event, pp 2195–2207

  8. Chen Z, Trabelsi M, Heflin J, et al (2020) Chen Z, Trabelsi M, Heflin J, et al (2020) Table search using a deep contextualized language model. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. ACM, Virtual Event, China, pp 589–598

  9. Chung J, Gülçehre Ç, Cho K, et al (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555

  10. Dai Z, Xiong C, Callan J, et al (2018) Convolutional neural networks for soft-matching n-grams in ad-hoc search. In: Proceedings of the 11th ACM international conference on web search and data mining. ACM, Marina Del Rey, USA, pp 126–134

  11. Devlin J, Chang MW, Lee K, et al (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186

  12. Gavrilov D, Kalaidin P, Malykh V (2019) Self-attentive model for headline generation. In: Proceedings of the 41st European conference on information retrieval research, Lecture Notes in Computer Science, vol 11438. Springer, Cologne, Germany, pp 87–93

  13. Glass M, Canim M, Gliozzo A, et al (2021) Capturing row and column semantics in transformer based question answering over tables. CoRR abs/2104.08303

  14. Govindaraju V, Zhang C, Ré C (2013) Understanding tables in context using standard NLP toolkits. In: Proceedings of the 51st annual meeting of the association for computational linguistics. the association for computer linguistics, Sofia, Bulgaria, pp 658–664

  15. Gu X, Mao Y, Han J, et al (2020) Generating representative headlines for news stories. In: Proceedings of the web conference 2020. ACM / IW3C2, Taipei, Taiwan, pp 1773–1784

  16. Guo J, Fan Y, Ai Q, et al (2016) A deep relevance matching model for ad-hoc retrieval. In: Proceedings of the 25th ACM international conference on information and knowledge management. ACM, Indianapolis, USA, pp 55–64

  17. Guo J, Fan Y, Ji X, et al (2019) Matchzoo: A learning, practicing, and developing system for neural text matching. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. ACM, Paris,France, pp 1297–1300

  18. Hu B, Lu Z, Li H, et al (2015) Convolutional neural network architectures for matching natural language sentences. CoRR abs/1503.03244

  19. Huang P, He X, Gao J, et al (2013) Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of the 22nd ACM international conference on information and knowledge management. ACM, San Francisco, USA, pp 2333–2338

  20. Karpukhin V, Oguz B, Min S, et al (2020) Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 conference on empirical methods in natural language processing. Association for Computational Linguistics, Virtual Event, pp 6769–6781

  21. Kim DH, Hoque E, Kim J, et al (2018) Facilitating document reading by linking text and tables. In: Proceedings of the 31st annual ACM symposium on user interface software and technology. ACM, Berlin, Germany, pp 423–434

  22. Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31th international conference on machine learning, Beijing, China, pp 1188–1196

  23. Lees AW, Yu C, Korn F, et al (2021) Collocating structured web tables with news articles. In: Proceedings of the 1st international workshop on news recommendation and intelligence co-located with the web conference 2021

  24. Li J, Dou Z, Zhu Y et al (2020) Deep cross-platform product matching in e-commerce. Inf Ret J 23(2):136–158

    Article  Google Scholar 

  25. Liu Y, Bai K, Mitra P, et al (2007a) Tablerank: A ranking algorithm for table search and retrieval. In: Proceedings of the 22nd AAAI conference on artificial intelligence, Vancouver, Canada, pp 317–322

  26. Liu Y, Bai K, Mitra P, et al (2007b) Tableseer: automatic table metadata extraction and searching in digital libraries. In: Proceedings of the 7th ACM/IEEE joint conference on digital libraries. ACM, Vancouver,Canada, pp 91–100

  27. Maity SK, Panigrahi A, Ghosh S, et al (2019) Deeptagrec: A content-cum-user based tag recommendation framework for stack overflow. In: Proceedings of the 41st European conference on information retrieval, Springer, Cologne, Germany, pp 125–131

  28. Mitra B, Diaz F, Craswell N (2017) Learning to match using local and distributed representations of text for web search. In: Proceedings of the 26th international conference on world wide web. ACM, Perth, Australia, pp 1291–1299

  29. Nallapati R, Zhou B, dos Santos CN, et al (2016) Abstractive text summarization using sequence-to-sequence RNNS and beyond. In: Proceedings of the 20th conference on computational natural language learning. ACL, Berlin, Germany, pp 280–290

  30. Nogueira R, Cho K (2019) Passage re-ranking with BERT. CoRR abs/1901.04085

  31. Pang L, Lan Y, Guo J, et al (2016) Text matching as image recognition. In: Proceedings of the 30th AAAI conference on artificial intelligence, Phoenix, USA, pp 2793–2799

  32. Pimplikar R, Sarawagi S (2012) Answering table queries on the web using column keywords. Proc VLDB Endow 5(10):908–919

    Article  Google Scholar 

  33. Pyreddy P, Croft WB (1997) TINTIN: A system for retrieval in text tables. In: Proceedings of the 2nd ACM international conference on digital libraries. ACM, Philadelphia, USA, pp 193–200

  34. Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing. Association for Computational Linguistics, Hong Kong, China, pp 3980–3990

  35. Robertson SE, Zaragoza H (2009) The probabilistic relevance framework: BM25 and beyond. Found Trends Inf Retr 3(4):333–389

    Article  Google Scholar 

  36. Rush AM, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. In: Proceedings of the 2015 conference on empirical methods in natural language processing. The Association for Computational Linguistics, Lisbon, Portugal, pp 379–389

  37. Salton G, Yang CS (1973) On the specification of term values in automatic indexing. Journal of Documentation

  38. Sanh V, Debut L, Chaumond J, et al (2019) Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108. http://arxiv.org/abs/1910.01108

  39. dos Santos CN, Barbosa L, Bogdanova D, et al (2015) Learning hybrid representations to retrieve semantically equivalent questions. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing of the Asian federation of natural language processing. The Association for Computer Linguistics, Beijing, China, pp 694–699

  40. Santosh TYSS, Saha A, Ganguly N (2020) MVL: multi-view learning for news recommendation. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. ACM, Virtual Event, China, pp 1873–1876

  41. Shraga R, Roitman H, Feigenblat G, et al (2020a) Ad hoc table retrieval using intrinsic and extrinsic similarities. In: Proceedings of the web conference 2020. ACM / IW3C2, Taipei, Taiwan, pp 2479–2485

  42. Shraga R, Roitman H, Feigenblat G, et al (2020b) Web table retrieval using multimodal deep learning. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. ACM, Virtual Event, China, pp 1399–1408

  43. Shraga R, Roitman H, Feigenblat G, et al (2020c) Projection-based relevance model for table retrieval. In: Proceedings of the web conference 2020. ACM / IW3C2, Taipei, Taiwan, pp 28–29

  44. Sun C, Qiu X, Xu Y, et al (2019a) How to fine-tune Bert for text classification? In: Proceedings of the 19th China national conference on Chinese computational linguistics, Springer, Hainan, China, pp 194–206

  45. Sun H, Ma H, He X, et al (2016) Table cell search for question answering. In: Proceedings of the 25th international conference on world wide web. ACM, Montreal, Canada, pp 771–782

  46. Sun Y, Yan Z, Tang D et al (2019) Content-based table retrieval for web queries. Neurocomputing 349:183–189

    Article  Google Scholar 

  47. Thakur N, Reimers N, Daxenberger J, et al (2021) Augmented SBERT: data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, Virtual Event, pp 296–310

  48. Trabelsi M, Davison BD, Heflin J (2019) Improved table retrieval using multiple context embeddings for attributes. In: Proceedings of the 2019 IEEE international conference on big data. IEEE, Los Angeles, USA, pp 1238–1244

  49. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Proceedings of the 30th international conference on neural information processing systems, Long Beach, USA, pp 5998–6008

  50. Venetis P, Halevy AY, Madhavan J et al (2011) Recovering semantics of tables on the web. Proc VLDB Endow 4(9):528–538

    Article  Google Scholar 

  51. Wan S, Lan Y, Guo J, et al (2016) A deep architecture for semantic matching with multiple positional sentence representations. In: Proceedings of the 30th AAAI conference on artificial intelligence, Phoenix, USA, pp 2835–2841

  52. Xiong C, Dai Z, Callan J, et al (2017a) End-to-end neural ad-hoc ranking with kernel pooling. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. ACM, Shinjuku, Japan, pp 55–64

  53. Xiong C, Zhong V, Socher R (2017b) Dynamic Coattention networks for question answering. In: Proceedings of the 5th international conference on learning representations, Toulon, France

  54. Zhang L, Zhang S, Balog K (2019) Table2vec: Neural word and entity embeddings for table population and retrieval. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. ACM, Paris, France, pp 1029–1032

  55. Zhang R, Guo J, Fan Y, et al (2018) Question headline generation for news articles. In: Proceedings of the 27th ACM international conference on information and knowledge management. ACM, Torino, Italy, pp 617–626

  56. Zhang S, Balog K (2018) Ad hoc table retrieval using semantic similarity. In: Proceedings of the web conference 2018. ACM, Lyon, France, pp 1553–1562

  57. Zhang S, Balog K (2020) Web table extraction, retrieval, and augmentation: a survey. ACM Trans Intell Syst Technol 11(2):1–35

    Article  Google Scholar 

  58. Zhu M, Ahuja A, Wei W, et al (2019) A hierarchical attention retrieval model for healthcare question answering. In: Proceedings of the web conference 2019. ACM, San Francisco, USA, pp 2472–2482

Download references

Acknowledgements

Not Applicable.

Funding

This work is supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) from Brazil.

Author information

Authors and Affiliations

Authors

Contributions

These authors contributed equally to this work.

Corresponding author

Correspondence to Levy Silva.

Ethics declarations

Conflict of interest

The authors declare they have no conflict of interest.

Ethical Approval and Consent to participate

Not Applicable.

Consent for publication

The authors agree to the publication of this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Silva, L., Barbosa, L. Matching news articles and wikipedia tables for news augmentation. Knowl Inf Syst 65, 1713–1734 (2023). https://doi.org/10.1007/s10115-022-01815-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01815-0

Keywords

Navigation