Abstract
Nowadays, digital-news understanding is often overwhelmed by the deluge of online information. One approach to cover this gap is to outline the news story by highlighting the most relevant facts. For example, recent studies summarize news articles by generating representative headlines. In this paper, we go beyond and argue news understanding can also be enhanced by surfacing contextual data relevant to the article, such as structured web tables. Specifically, our goal is to match news articles and web tables for news augmentation. For that, we introduce a novel BERT-based attention model to compute this matching degree. Through an extensive experimental evaluation over Wikipedia tables, we compare the performance of our model with standard IR techniques, document/sentence encoders and neural IR models for this task. The overall results point out our model outperforms all baselines at different levels of accuracy and in the mean reciprocal ranking measure.


Availability of data and materials
The datasets and source code are public available in: https://github.com/levysouza/News-Table-Matching.
Notes
The Web has over 14.1B tables Cafarella et al. [3].
We demonstrate its performance in our ablation study (see results in Table 8).
The BERT architecture has over 340 million parameters.
We also try Long Short Term Memory (LSTM) for this step but GRUs achieved better results.
Our ablation study shows we can improve the model performance by joining such two attention methodologies (see results in Table 8).
We try the following similarity thresholds for the cosine distance over positive and negative pairs: 0.3, 0.4, 0.5, 0.6, and 0.7, in which 0.3 achieves the best results in our validation dataset.
We do not combine the table-content and its surrounding text since the table-aspects achieve the worst results for the evaluated metric.
The training time for cross and bi-encoder BERT-based models are similar as both of them are composed of the BERT architecture. In our experiments, their fine-tuning time is over 20min per epoch.
References
Agarwal S, Singh NK, Meel P (2018) Single-document summarization using sentence embeddings and k-means clustering. In: Proceedings of the 2018 international conference on advances in computing, communication control and networking, IEEE, pp 162–165
Bhagavatula CS, Noraset T, Downey D (2013) Methods for exploring and mining tables on wikipedia. In: Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics. ACM, pp 18–26
Cafarella MJ, Halevy AY, Wang DZ et al (2008) Webtables: exploring the power of tables on the web. Proc VLDB Endow 1(1):538–549
Cafarella MJ, Halevy AY, Lee H et al (2018) Ten years of webtables. Proc VLDB Endow 11(12):2140–2149
Cer D, Yang Y, Kong SY, et al (2018) Universal sentence encoder. CoRR abs/1803.11175
Chakrabarti K, Chen Z, Shakeri S, et al (2020) Tableqna: Answering list intent queries with web tables. CoRR abs/2001.04828
Chen X, Cheng Y, Wang S, et al (2021) Earlybert: Efficient BERT training via early-bird lottery tickets. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing. Association for Computational Linguistics, Virtual Event, pp 2195–2207
Chen Z, Trabelsi M, Heflin J, et al (2020) Chen Z, Trabelsi M, Heflin J, et al (2020) Table search using a deep contextualized language model. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. ACM, Virtual Event, China, pp 589–598
Chung J, Gülçehre Ç, Cho K, et al (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555
Dai Z, Xiong C, Callan J, et al (2018) Convolutional neural networks for soft-matching n-grams in ad-hoc search. In: Proceedings of the 11th ACM international conference on web search and data mining. ACM, Marina Del Rey, USA, pp 126–134
Devlin J, Chang MW, Lee K, et al (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186
Gavrilov D, Kalaidin P, Malykh V (2019) Self-attentive model for headline generation. In: Proceedings of the 41st European conference on information retrieval research, Lecture Notes in Computer Science, vol 11438. Springer, Cologne, Germany, pp 87–93
Glass M, Canim M, Gliozzo A, et al (2021) Capturing row and column semantics in transformer based question answering over tables. CoRR abs/2104.08303
Govindaraju V, Zhang C, Ré C (2013) Understanding tables in context using standard NLP toolkits. In: Proceedings of the 51st annual meeting of the association for computational linguistics. the association for computer linguistics, Sofia, Bulgaria, pp 658–664
Gu X, Mao Y, Han J, et al (2020) Generating representative headlines for news stories. In: Proceedings of the web conference 2020. ACM / IW3C2, Taipei, Taiwan, pp 1773–1784
Guo J, Fan Y, Ai Q, et al (2016) A deep relevance matching model for ad-hoc retrieval. In: Proceedings of the 25th ACM international conference on information and knowledge management. ACM, Indianapolis, USA, pp 55–64
Guo J, Fan Y, Ji X, et al (2019) Matchzoo: A learning, practicing, and developing system for neural text matching. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. ACM, Paris,France, pp 1297–1300
Hu B, Lu Z, Li H, et al (2015) Convolutional neural network architectures for matching natural language sentences. CoRR abs/1503.03244
Huang P, He X, Gao J, et al (2013) Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of the 22nd ACM international conference on information and knowledge management. ACM, San Francisco, USA, pp 2333–2338
Karpukhin V, Oguz B, Min S, et al (2020) Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 conference on empirical methods in natural language processing. Association for Computational Linguistics, Virtual Event, pp 6769–6781
Kim DH, Hoque E, Kim J, et al (2018) Facilitating document reading by linking text and tables. In: Proceedings of the 31st annual ACM symposium on user interface software and technology. ACM, Berlin, Germany, pp 423–434
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31th international conference on machine learning, Beijing, China, pp 1188–1196
Lees AW, Yu C, Korn F, et al (2021) Collocating structured web tables with news articles. In: Proceedings of the 1st international workshop on news recommendation and intelligence co-located with the web conference 2021
Li J, Dou Z, Zhu Y et al (2020) Deep cross-platform product matching in e-commerce. Inf Ret J 23(2):136–158
Liu Y, Bai K, Mitra P, et al (2007a) Tablerank: A ranking algorithm for table search and retrieval. In: Proceedings of the 22nd AAAI conference on artificial intelligence, Vancouver, Canada, pp 317–322
Liu Y, Bai K, Mitra P, et al (2007b) Tableseer: automatic table metadata extraction and searching in digital libraries. In: Proceedings of the 7th ACM/IEEE joint conference on digital libraries. ACM, Vancouver,Canada, pp 91–100
Maity SK, Panigrahi A, Ghosh S, et al (2019) Deeptagrec: A content-cum-user based tag recommendation framework for stack overflow. In: Proceedings of the 41st European conference on information retrieval, Springer, Cologne, Germany, pp 125–131
Mitra B, Diaz F, Craswell N (2017) Learning to match using local and distributed representations of text for web search. In: Proceedings of the 26th international conference on world wide web. ACM, Perth, Australia, pp 1291–1299
Nallapati R, Zhou B, dos Santos CN, et al (2016) Abstractive text summarization using sequence-to-sequence RNNS and beyond. In: Proceedings of the 20th conference on computational natural language learning. ACL, Berlin, Germany, pp 280–290
Nogueira R, Cho K (2019) Passage re-ranking with BERT. CoRR abs/1901.04085
Pang L, Lan Y, Guo J, et al (2016) Text matching as image recognition. In: Proceedings of the 30th AAAI conference on artificial intelligence, Phoenix, USA, pp 2793–2799
Pimplikar R, Sarawagi S (2012) Answering table queries on the web using column keywords. Proc VLDB Endow 5(10):908–919
Pyreddy P, Croft WB (1997) TINTIN: A system for retrieval in text tables. In: Proceedings of the 2nd ACM international conference on digital libraries. ACM, Philadelphia, USA, pp 193–200
Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing. Association for Computational Linguistics, Hong Kong, China, pp 3980–3990
Robertson SE, Zaragoza H (2009) The probabilistic relevance framework: BM25 and beyond. Found Trends Inf Retr 3(4):333–389
Rush AM, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. In: Proceedings of the 2015 conference on empirical methods in natural language processing. The Association for Computational Linguistics, Lisbon, Portugal, pp 379–389
Salton G, Yang CS (1973) On the specification of term values in automatic indexing. Journal of Documentation
Sanh V, Debut L, Chaumond J, et al (2019) Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108. http://arxiv.org/abs/1910.01108
dos Santos CN, Barbosa L, Bogdanova D, et al (2015) Learning hybrid representations to retrieve semantically equivalent questions. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing of the Asian federation of natural language processing. The Association for Computer Linguistics, Beijing, China, pp 694–699
Santosh TYSS, Saha A, Ganguly N (2020) MVL: multi-view learning for news recommendation. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. ACM, Virtual Event, China, pp 1873–1876
Shraga R, Roitman H, Feigenblat G, et al (2020a) Ad hoc table retrieval using intrinsic and extrinsic similarities. In: Proceedings of the web conference 2020. ACM / IW3C2, Taipei, Taiwan, pp 2479–2485
Shraga R, Roitman H, Feigenblat G, et al (2020b) Web table retrieval using multimodal deep learning. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. ACM, Virtual Event, China, pp 1399–1408
Shraga R, Roitman H, Feigenblat G, et al (2020c) Projection-based relevance model for table retrieval. In: Proceedings of the web conference 2020. ACM / IW3C2, Taipei, Taiwan, pp 28–29
Sun C, Qiu X, Xu Y, et al (2019a) How to fine-tune Bert for text classification? In: Proceedings of the 19th China national conference on Chinese computational linguistics, Springer, Hainan, China, pp 194–206
Sun H, Ma H, He X, et al (2016) Table cell search for question answering. In: Proceedings of the 25th international conference on world wide web. ACM, Montreal, Canada, pp 771–782
Sun Y, Yan Z, Tang D et al (2019) Content-based table retrieval for web queries. Neurocomputing 349:183–189
Thakur N, Reimers N, Daxenberger J, et al (2021) Augmented SBERT: data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, Virtual Event, pp 296–310
Trabelsi M, Davison BD, Heflin J (2019) Improved table retrieval using multiple context embeddings for attributes. In: Proceedings of the 2019 IEEE international conference on big data. IEEE, Los Angeles, USA, pp 1238–1244
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Proceedings of the 30th international conference on neural information processing systems, Long Beach, USA, pp 5998–6008
Venetis P, Halevy AY, Madhavan J et al (2011) Recovering semantics of tables on the web. Proc VLDB Endow 4(9):528–538
Wan S, Lan Y, Guo J, et al (2016) A deep architecture for semantic matching with multiple positional sentence representations. In: Proceedings of the 30th AAAI conference on artificial intelligence, Phoenix, USA, pp 2835–2841
Xiong C, Dai Z, Callan J, et al (2017a) End-to-end neural ad-hoc ranking with kernel pooling. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. ACM, Shinjuku, Japan, pp 55–64
Xiong C, Zhong V, Socher R (2017b) Dynamic Coattention networks for question answering. In: Proceedings of the 5th international conference on learning representations, Toulon, France
Zhang L, Zhang S, Balog K (2019) Table2vec: Neural word and entity embeddings for table population and retrieval. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. ACM, Paris, France, pp 1029–1032
Zhang R, Guo J, Fan Y, et al (2018) Question headline generation for news articles. In: Proceedings of the 27th ACM international conference on information and knowledge management. ACM, Torino, Italy, pp 617–626
Zhang S, Balog K (2018) Ad hoc table retrieval using semantic similarity. In: Proceedings of the web conference 2018. ACM, Lyon, France, pp 1553–1562
Zhang S, Balog K (2020) Web table extraction, retrieval, and augmentation: a survey. ACM Trans Intell Syst Technol 11(2):1–35
Zhu M, Ahuja A, Wei W, et al (2019) A hierarchical attention retrieval model for healthcare question answering. In: Proceedings of the web conference 2019. ACM, San Francisco, USA, pp 2472–2482
Acknowledgements
Not Applicable.
Funding
This work is supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) from Brazil.
Author information
Authors and Affiliations
Contributions
These authors contributed equally to this work.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare they have no conflict of interest.
Ethical Approval and Consent to participate
Not Applicable.
Consent for publication
The authors agree to the publication of this study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Silva, L., Barbosa, L. Matching news articles and wikipedia tables for news augmentation. Knowl Inf Syst 65, 1713–1734 (2023). https://doi.org/10.1007/s10115-022-01815-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-022-01815-0