Tough Tables: Carefully Evaluating Entity Linking for Tabular Data

Cutrona, Vincenzo; Bianchi, Federico; Jiménez-Ruiz, Ernesto; Palmonari, Matteo

doi:10.1007/978-3-030-62466-8_21

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12507))

Included in the following conference series:

International Semantic Web Conference

3553 Accesses
15 Citations

Abstract

Table annotation is a key task to improve querying the Web and support the Knowledge Graph population from legacy sources (tables). Last year, the SemTab challenge was introduced to unify different efforts to evaluate table annotation algorithms by providing a common interface and several general-purpose datasets as a ground truth. The SemTab dataset is useful to have a general understanding of how these algorithms work, and the organizers of the challenge included some artificial noise to the data to make the annotation trickier. However, it is hard to analyze specific aspects in an automatic way. For example, the ambiguity of names at the entity-level can largely affect the quality of the annotation. In this paper, we propose a novel dataset to complement the datasets proposed by SemTab. The dataset consists of a set of high-quality manually-curated tables with non-obviously linkable cells, i.e., where values are ambiguous names, typos, and misspelled entity names not appearing in the current version of the SemTab dataset. These challenges are particularly relevant for the ingestion of structured legacy sources into existing knowledge graphs. Evaluations run on this dataset show that ambiguity is a key problem for entity linking algorithms and encourage a promising direction for future work in the field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
T2Dv2: http://webdatacommons.org/webtables/goldstandardV2.html.
2.
See Tables 53822652_0_5767892317858575530 and 12th_Goya_Awards#1 from Round 1 and Round 2, respectively. These errors come from the T2D and W2D datasets used in SemTab 2019.
3.
We checked our annotations against a private replica of the online DBpedia SPARQL endpoint in a local instance, loading the 2016-10 datasets listed at https://wiki.dbpedia.org/public-sparql-endpoint.
4.
We used the online version at http://dbpedia.org/sparql.
5.
This strategy, that might look naive, is the same implemented in OpenRefine, where the first 10 rows are used to suggest the possible types of the current column.
6.
The SemTab 2019 challenge provided the target file with the full list of cells to annotate, disregarding novel facts.
7.
We point out that some homonyms are very easy to solve using DBpedia (e.g., US cities are easy to find, since just appending the state of a city to its canonical name points directly to the right city, e.g., the Cambridge city in Illinois is dbr:Cambridge,_Illinois in DBpedia).
8.
Note that it is possible to solve this problem using a mapping dictionary if available, but this is not a desired solution: this will not make the algorithm smart; the same is true for looking up on Google Search.
9.
https://github.com/sem-tab-challenge/aicrowd-evaluator.
10.
We used the WikipediaSearch online service available at https://en.wikipedia.org/w/api.php, while we recreated the DBLookup online instance on a dedicated virtual machine.
11.
A fork of the original code repository is available at https://bitbucket.org/vcutrona/mantistable-tool.py.
12.
The standard format introduced in SemTab2019 is directly derived from the T2Dv2 one, thus the number of algorithms that can be tested is potentially greater.
13.
https://www.cs.ox.ac.uk/isg/challenges/sem-tab/2019/results.html.
14.
https://www.nature.com/articles/sdata201618.
15.
https://creativecommons.org/licenses/by/4.0/.
16.
https://github.com/vcutrona/tough-tables.
17.
The SemTab 2020 challenge is still in progress and it is organized to provide tables without known ground truth. For this reason, we will publish the full 2 T dataset, including the ground truth files, at the end of SemTab 2020.
18.
The instance types file at http://downloads.dbpedia.org/2016-10/core-i18n/en/instance_types_en.ttl.bz2 contains entities most specific types.
19.
http://downloads.dbpedia.org/2016-10/dbpedia_2016-10.nt.

References

Cremaschi, M., De Paoli, F., Rula, A., Spahiu, B.: A fully automated approach to a complete semantic table interpretation. Fut. Gener. Comput. Syst. 112, 478–500 (2020). https://doi.org/10.1016/j.future.2020.05.019
Article Google Scholar
Cremaschi, M., Siano, A., Avogadro, R., Jimenez-Ruiz, E., Maurino, A.: STILTool: a semantic table interpretation evaluation tool. In: ESWC P&D (2020)
Google Scholar
Cutrona, V., et al.: Semantically-enabled optimization of digital marketing campaigns. In: Ghidini, C., et al. (eds.) ISWC 2019. LNCS, vol. 11779, pp. 345–362. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30796-7_22
Chapter Google Scholar
Efthymiou, V., Hassanzadeh, O., Rodriguez-Muro, M., Christophides, V.: Matching web tables with knowledge base entities: from entity lookups to entity embeddings. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10587, pp. 260–277. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68288-4_16
Chapter Google Scholar
Hussain, F., Qamar, U.: Identification and correction of misspelled drugs names in electronic medical records (EMR). In: ICEIS 2016 - Proceedings of the 18th International Conference on Enterprise Information Systems, vol. 2, pp. 333–338 (2016). https://doi.org/10.5220/0005911503330338
Jiang, K., Chen, T., Huang, L., Calix, R.A., Bernard, G.R.: A data-driven method of discovering misspellings of medication names on twitter. In: Building Continents of Knowledge in Oceans of Data: The Future of Co-Created eHealth - Proceedings of MIE 2018, Medical Informatics Europe. Studies in Health Technology and Informatics, vol. 247, pp. 136–140 (2018). https://doi.org/10.3233/978-1-61499-852-5-136
Jiménez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K.: SemTab 2019: resources to benchmark tabular data to knowledge graph matching systems. In: Harth, A., et al. (eds.) ESWC 2020. LNCS, vol. 12123, pp. 514–530. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49461-2_30
Chapter Google Scholar
Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. VLDB 3(1), 1338–1347 (2010). 10.14778/1920841.1921005
Google Scholar
Nguyen, P., Kertkeidkachorn, N., Ichise, R., Takeda, H.: MTab: matching tabular data to knowledge graph using probability models. In: SemTab@ISWC. CEUR Workshop Proceedings, vol. 2553, pp. 7–14 (2019)
Google Scholar
Ritze, D., Lehmberg, O., Oulabi, Y., Bizer, C.: Profiling the potential of web tables for augmenting cross-domain knowledge bases. In: Proceedings of the 25th International Conference on World Wide Web, WWW 2016. pp. 251–261. ACM (2016). https://doi.org/10.1145/2872427.2883017
Sun, H., Ma, H., He, X., Yih, W.T., Su, Y., Yan, X.: Table cell search for question answering. In: WWW, International World Wide Web Conferences Steering Committee, pp. 771–782 (2016). https://doi.org/10.1145/2872427.2883080
Thawani, A., et al.: Entity linking to knowledge graphs to infer column types and properties. In: SemTab@ISWC. CEUR Workshop Proceedings, vol. 2553, pp. 25–32 (2019)
Google Scholar
Vandewiele, G., Steenwinckel, B., Turck, F.D., Ongenae, F.: CVS2KG: transforming tabular data into semantic knowledge. In: SemTab@ISWC. CEUR Workshop Proceedings, vol. 2553, pp. 33–40 (2019)
Google Scholar
Zhang, S., Meij, E., Balog, K., Reinanda, R.: Novel entity discovery from web tables. In: Proceedings of The Web Conference 2020, WWW 2020, pp. 1298–1308 (2020). https://doi.org/10.1145/3366423.3380205

Download references

Acknowledgments

We would like to thank the authors of MantisTable for sharing the prototype source code. This work was partially supported by the Google Cloud Platform Education Grant. EJR was supported by the SIRIUS Centre for Scalable Data Access (Research Council of Norway). FB is member of the Bocconi Institute for Data Science and Analytics (BIDSA) and the Data and Marketing Insights (DMI) unit.

Author information

Authors and Affiliations

University of Milano - Bicocca, Milan, Italy
Vincenzo Cutrona & Matteo Palmonari
Bocconi University, Milan, Italy
Federico Bianchi
City, University of London, London, UK
Ernesto Jiménez-Ruiz
University of Oslo, Oslo, Norway
Ernesto Jiménez-Ruiz

Authors

Vincenzo Cutrona
View author publications
You can also search for this author in PubMed Google Scholar
Federico Bianchi
View author publications
You can also search for this author in PubMed Google Scholar
Ernesto Jiménez-Ruiz
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Palmonari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Federico Bianchi .

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, UK
Jeff Z. Pan
University of Liverpool, Liverpool, UK
Valentina Tamma
University of Bari, Bari, Italy
Claudia d’Amato
University of California, Santa Barbara, Santa Barbara, CA, USA
Krzysztof Janowicz
California State University, Long Beach, Long Beach, CA, USA
Bo Fu
Vienna University of Economics and Business, Vienna, Austria
Axel Polleres
Rensselaer Polytechnic Institute, Troy, NY, USA
Oshani Seneviratne
Massachusetts Institute of Technology, Cambridge, MA, USA
Lalana Kagal

A 2T Ground Truth Generation Details

1.1 A.1 CEA Table Generation and Preprocessing

2T has been built using real tables. Here we clarify that as a “real table” we intend a table, also artificially built, which resembles a real table. Examples are “list of companies with their market segment”, or “list of Italian merged political parties”, which look like queries that a manager or a journalist could make against a database. The main reasons behind this choice are: (i) it is difficult to get access to real databases; (ii) open data make available a lot of tables, but mostly always tables are in an aggregated form that makes it difficult to annotate them with entities from a general KG like DBpedia. When the data are fine-grained enough, almost all the entities mentioned are not available in the reference KG. For example, in the list of bank failures got from the U.S. Open Data Portal, only 27 over 561 failed banks are represented in DBpedia.

In this section we describe the processes we adopted to collect real tables, or build tables that resembles real ones.

DBpedia Tables. We used the DBpedia SPARQL endpoint as a table generator (SPARQL results are tables). We run queries to generate tables that include:

entity columns: columns with DBpedia URIs that represent entities.
“label columns”: columns with possible mentions for the corresponding entities in the entity column. Given an entity column, the corresponding label column has been created by randomly choosing between rdfs:label, foaf:name, or dbo:alias properties.
literal columns: other columns, with additional information.

Wikipedia Tables. We browsed Wikipedia looking for pages containing tables of interest (e.g., list of presidents, list of companies, list of singers, etc.). We generated different versions of the collected Wikipedia tables, applying different cleaning steps. The following steps have been applied to Wikipedia tables in the TOUGH_MISC category:

Merged cells have been split in multiple cells with the same value.
Multi-value cells (slash-separated values, e.g., Pop / Rock, or multi-line values, e.g., Barbados <br> United States, or in-line lists, e.g., <ul>, <li>) have been exploded into several lines. If two or more multi-value cells are on the same line, we exploded all the cells (cartesian product of all the values). If a cell contains the same information in more languages (e.g., anthem song titles), we exploded the cell in two or more columns (the creation of new lines would basically represent duplicates).

Wikipedia tables in the CTRL_WIKI group underwent the next additional cleaning steps:

“Note”, “Description”, and similar long-text columns have been removed.
Cells with “None”, “null”, “N/A”, “Unaffiliated”, and similar values have been emptied.
Columns with only images (e.g., List of US presidents) have been removed.
All HTML tags have been deleted from cells (e.g., country flag icons);
Notes, footnotes, and any other additional within-cell information (e.g., birthYear and deathYear for U.S. presidents) have been removed.

Most of all the tables values are already hyperlinked to their Wikipedia page. We used the hyperlinks as the correct annotations (we trust Wikipedia as a correct source of information), following these criteria:

If a cell content has several links, we took the most relevant annotation, given the column context (e.g., in table https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States#Presidents the “U.S. senator from Tennessee” cell in the “Prior office” column contains two annotations: https://en.wikipedia.org/wiki/U.S._senator and https://en.wikipedia.org/wiki/Tennessee; in this case we took only the https://en.wikipedia.org/wiki/U.S._senator annotation, as the column is about “Prior offices”, not about places).
Sometimes it happens that if the same value appears several times in the same column (e.g., music genres), only one instance has the hyperlink to the Wikipedia page. In these cases we copied the same hyperlink to all the instances.
When the hyperlink is missing (e.g., Hard Rock labels in the table https://en.wikipedia.org/wiki/List_of_best-selling_music_artists#250_million_or_more_records), we manually added the right links by visiting the main entity page (e.g., https://en.wikipedia.org/wiki/Led_Zeppelin) and looking for the missing piece of information (e.g., under the “Genre” section on the Led Zeppelin page we can find Hard Rock linked to https://en.wikipedia.org/wiki/Hard_rock). In case when the information is missing in the main page (e.g., in the same table, Michael Jackson genres include “Dance”, while on his Wiki page the genre is Dance-pop), we manually annotated the value with the most related entity in Wikipedia (in this case, the music genre Dance https://en.wikipedia.org/wiki/Dance_music).

Finally, we converted the Wikipedia links to their DBpedia correspondent links, by replacing https://en.wikipedia.org/wiki/ with http://dbpedia.org/resource in the decoded URL, e.g., https://en.wikipedia.org/wiki/McDonald%27s \(\rightarrow \) dbr:McDonald’s, if available, otherwise we manually looked for the right dbpedia link (e.g., https://en.wikipedia.org/wiki/1788-89_United_States_presidential_election \(\rightarrow \) dbr:United_States_presidential_election,_1788-89). If this attempt also failed, we left the cell blank (no annotations available in DBpedia).

1.2 A.2 CTA Ground Truth Construction

Automatic CTA Annotations from CEA. The 2T dataset focus is mainly on the entities because, in our opinion, the CEA task is the core task: with good performance in CEA, it is possible to approximate the CTA task easily. We exploited this observation to automatically construct the CTA annotations starting from the CEA ones, which we trust. For each annotated column, we collected all the annotated entities from the CEA dataset and retrieved the most specific type for all the entities from the DBpedia 2016-10 dump.^{Footnote 18} We then annotate the column with the most specific supertype, i.e., the lowest common ancestor of all the types in the DBpedia 2016-10 ontology.^{Footnote 19}

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cutrona, V., Bianchi, F., Jiménez-Ruiz, E., Palmonari, M. (2020). Tough Tables: Carefully Evaluating Entity Linking for Tabular Data. In: Pan, J.Z., et al. The Semantic Web – ISWC 2020. ISWC 2020. Lecture Notes in Computer Science(), vol 12507. Springer, Cham. https://doi.org/10.1007/978-3-030-62466-8_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-62466-8_21
Published: 01 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62465-1
Online ISBN: 978-3-030-62466-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the Semantic Web Science Association (opens in a new tab)

Tough Tables: Carefully Evaluating Entity Linking for Tabular Data

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A 2T Ground Truth Generation Details

A 2T Ground Truth Generation Details

1.1 A.1 CEA Table Generation and Preprocessing

1.2 A.2 CTA Ground Truth Construction

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation