Skip to main content

Tough Tables: Carefully Evaluating Entity Linking for Tabular Data

  • Conference paper
  • First Online:
The Semantic Web – ISWC 2020 (ISWC 2020)

Abstract

Table annotation is a key task to improve querying the Web and support the Knowledge Graph population from legacy sources (tables). Last year, the SemTab challenge was introduced to unify different efforts to evaluate table annotation algorithms by providing a common interface and several general-purpose datasets as a ground truth. The SemTab dataset is useful to have a general understanding of how these algorithms work, and the organizers of the challenge included some artificial noise to the data to make the annotation trickier. However, it is hard to analyze specific aspects in an automatic way. For example, the ambiguity of names at the entity-level can largely affect the quality of the annotation. In this paper, we propose a novel dataset to complement the datasets proposed by SemTab. The dataset consists of a set of high-quality manually-curated tables with non-obviously linkable cells, i.e., where values are ambiguous names, typos, and misspelled entity names not appearing in the current version of the SemTab dataset. These challenges are particularly relevant for the ingestion of structured legacy sources into existing knowledge graphs. Evaluations run on this dataset show that ambiguity is a key problem for entity linking algorithms and encourage a promising direction for future work in the field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    T2Dv2: http://webdatacommons.org/webtables/goldstandardV2.html.

  2. 2.

    See Tables 53822652_0_5767892317858575530 and 12th_Goya_Awards#1 from Round 1 and Round 2, respectively. These errors come from the T2D and W2D datasets used in SemTab 2019.

  3. 3.

    We checked our annotations against a private replica of the online DBpedia SPARQL endpoint in a local instance, loading the 2016-10 datasets listed at https://wiki.dbpedia.org/public-sparql-endpoint.

  4. 4.

    We used the online version at http://dbpedia.org/sparql.

  5. 5.

    This strategy, that might look naive, is the same implemented in OpenRefine, where the first 10 rows are used to suggest the possible types of the current column.

  6. 6.

    The SemTab 2019 challenge provided the target file with the full list of cells to annotate, disregarding novel facts.

  7. 7.

    We point out that some homonyms are very easy to solve using DBpedia (e.g., US cities are easy to find, since just appending the state of a city to its canonical name points directly to the right city, e.g., the Cambridge city in Illinois is dbr:Cambridge,_Illinois in DBpedia).

  8. 8.

    Note that it is possible to solve this problem using a mapping dictionary if available, but this is not a desired solution: this will not make the algorithm smart; the same is true for looking up on Google Search.

  9. 9.

    https://github.com/sem-tab-challenge/aicrowd-evaluator.

  10. 10.

    We used the WikipediaSearch online service available at https://en.wikipedia.org/w/api.php, while we recreated the DBLookup online instance on a dedicated virtual machine.

  11. 11.

    A fork of the original code repository is available at https://bitbucket.org/vcutrona/mantistable-tool.py.

  12. 12.

    The standard format introduced in SemTab2019 is directly derived from the T2Dv2 one, thus the number of algorithms that can be tested is potentially greater.

  13. 13.

    https://www.cs.ox.ac.uk/isg/challenges/sem-tab/2019/results.html.

  14. 14.

    https://www.nature.com/articles/sdata201618.

  15. 15.

    https://creativecommons.org/licenses/by/4.0/.

  16. 16.

    https://github.com/vcutrona/tough-tables.

  17. 17.

    The SemTab 2020 challenge is still in progress and it is organized to provide tables without known ground truth. For this reason, we will publish the full 2 T dataset, including the ground truth files, at the end of SemTab 2020.

  18. 18.

    The instance types file at http://downloads.dbpedia.org/2016-10/core-i18n/en/instance_types_en.ttl.bz2 contains entities most specific types.

  19. 19.

    http://downloads.dbpedia.org/2016-10/dbpedia_2016-10.nt.

References

  1. Cremaschi, M., De Paoli, F., Rula, A., Spahiu, B.: A fully automated approach to a complete semantic table interpretation. Fut. Gener. Comput. Syst. 112, 478–500 (2020). https://doi.org/10.1016/j.future.2020.05.019

    Article  Google Scholar 

  2. Cremaschi, M., Siano, A., Avogadro, R., Jimenez-Ruiz, E., Maurino, A.: STILTool: a semantic table interpretation evaluation tool. In: ESWC P&D (2020)

    Google Scholar 

  3. Cutrona, V., et al.: Semantically-enabled optimization of digital marketing campaigns. In: Ghidini, C., et al. (eds.) ISWC 2019. LNCS, vol. 11779, pp. 345–362. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30796-7_22

    Chapter  Google Scholar 

  4. Efthymiou, V., Hassanzadeh, O., Rodriguez-Muro, M., Christophides, V.: Matching web tables with knowledge base entities: from entity lookups to entity embeddings. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10587, pp. 260–277. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68288-4_16

    Chapter  Google Scholar 

  5. Hussain, F., Qamar, U.: Identification and correction of misspelled drugs names in electronic medical records (EMR). In: ICEIS 2016 - Proceedings of the 18th International Conference on Enterprise Information Systems, vol. 2, pp. 333–338 (2016). https://doi.org/10.5220/0005911503330338

  6. Jiang, K., Chen, T., Huang, L., Calix, R.A., Bernard, G.R.: A data-driven method of discovering misspellings of medication names on twitter. In: Building Continents of Knowledge in Oceans of Data: The Future of Co-Created eHealth - Proceedings of MIE 2018, Medical Informatics Europe. Studies in Health Technology and Informatics, vol. 247, pp. 136–140 (2018). https://doi.org/10.3233/978-1-61499-852-5-136

  7. Jiménez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K.: SemTab 2019: resources to benchmark tabular data to knowledge graph matching systems. In: Harth, A., et al. (eds.) ESWC 2020. LNCS, vol. 12123, pp. 514–530. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49461-2_30

    Chapter  Google Scholar 

  8. Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. VLDB 3(1), 1338–1347 (2010). 10.14778/1920841.1921005

    Google Scholar 

  9. Nguyen, P., Kertkeidkachorn, N., Ichise, R., Takeda, H.: MTab: matching tabular data to knowledge graph using probability models. In: SemTab@ISWC. CEUR Workshop Proceedings, vol. 2553, pp. 7–14 (2019)

    Google Scholar 

  10. Ritze, D., Lehmberg, O., Oulabi, Y., Bizer, C.: Profiling the potential of web tables for augmenting cross-domain knowledge bases. In: Proceedings of the 25th International Conference on World Wide Web, WWW 2016. pp. 251–261. ACM (2016). https://doi.org/10.1145/2872427.2883017

  11. Sun, H., Ma, H., He, X., Yih, W.T., Su, Y., Yan, X.: Table cell search for question answering. In: WWW, International World Wide Web Conferences Steering Committee, pp. 771–782 (2016). https://doi.org/10.1145/2872427.2883080

  12. Thawani, A., et al.: Entity linking to knowledge graphs to infer column types and properties. In: SemTab@ISWC. CEUR Workshop Proceedings, vol. 2553, pp. 25–32 (2019)

    Google Scholar 

  13. Vandewiele, G., Steenwinckel, B., Turck, F.D., Ongenae, F.: CVS2KG: transforming tabular data into semantic knowledge. In: SemTab@ISWC. CEUR Workshop Proceedings, vol. 2553, pp. 33–40 (2019)

    Google Scholar 

  14. Zhang, S., Meij, E., Balog, K., Reinanda, R.: Novel entity discovery from web tables. In: Proceedings of The Web Conference 2020, WWW 2020, pp. 1298–1308 (2020). https://doi.org/10.1145/3366423.3380205

Download references

Acknowledgments

We would like to thank the authors of MantisTable for sharing the prototype source code. This work was partially supported by the Google Cloud Platform Education Grant. EJR was supported by the SIRIUS Centre for Scalable Data Access (Research Council of Norway). FB is member of the Bocconi Institute for Data Science and Analytics (BIDSA) and the Data and Marketing Insights (DMI) unit.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Federico Bianchi .

Editor information

Editors and Affiliations

A 2T Ground Truth Generation Details

A 2T Ground Truth Generation Details

1.1 A.1 CEA Table Generation and Preprocessing

2T has been built using real tables. Here we clarify that as a “real table” we intend a table, also artificially built, which resembles a real table. Examples are “list of companies with their market segment”, or “list of Italian merged political parties”, which look like queries that a manager or a journalist could make against a database. The main reasons behind this choice are: (i) it is difficult to get access to real databases; (ii) open data make available a lot of tables, but mostly always tables are in an aggregated form that makes it difficult to annotate them with entities from a general KG like DBpedia. When the data are fine-grained enough, almost all the entities mentioned are not available in the reference KG. For example, in the list of bank failures got from the U.S. Open Data Portal, only 27 over 561 failed banks are represented in DBpedia.

In this section we describe the processes we adopted to collect real tables, or build tables that resembles real ones.

DBpedia Tables. We used the DBpedia SPARQL endpoint as a table generator (SPARQL results are tables). We run queries to generate tables that include:

  • entity columns: columns with DBpedia URIs that represent entities.

  • “label columns”: columns with possible mentions for the corresponding entities in the entity column. Given an entity column, the corresponding label column has been created by randomly choosing between rdfs:label, foaf:name, or dbo:alias properties.

  • literal columns: other columns, with additional information.

Wikipedia Tables. We browsed Wikipedia looking for pages containing tables of interest (e.g., list of presidents, list of companies, list of singers, etc.). We generated different versions of the collected Wikipedia tables, applying different cleaning steps. The following steps have been applied to Wikipedia tables in the TOUGH_MISC category:

  • Merged cells have been split in multiple cells with the same value.

  • Multi-value cells (slash-separated values, e.g., Pop / Rock, or multi-line values, e.g., Barbados <br> United States, or in-line lists, e.g., <ul>, <li>) have been exploded into several lines. If two or more multi-value cells are on the same line, we exploded all the cells (cartesian product of all the values). If a cell contains the same information in more languages (e.g., anthem song titles), we exploded the cell in two or more columns (the creation of new lines would basically represent duplicates).

Wikipedia tables in the CTRL_WIKI group underwent the next additional cleaning steps:

  • “Note”, “Description”, and similar long-text columns have been removed.

  • Cells with “None”, “null”, “N/A”, “Unaffiliated”, and similar values have been emptied.

  • Columns with only images (e.g., List of US presidents) have been removed.

  • All HTML tags have been deleted from cells (e.g., country flag icons);

  • Notes, footnotes, and any other additional within-cell information (e.g., birthYear and deathYear for U.S. presidents) have been removed.

Most of all the tables values are already hyperlinked to their Wikipedia page. We used the hyperlinks as the correct annotations (we trust Wikipedia as a correct source of information), following these criteria:

Finally, we converted the Wikipedia links to their DBpedia correspondent links, by replacing https://en.wikipedia.org/wiki/ with http://dbpedia.org/resource in the decoded URL, e.g., https://en.wikipedia.org/wiki/McDonald%27s \(\rightarrow \) dbr:McDonald’s, if available, otherwise we manually looked for the right dbpedia link (e.g., https://en.wikipedia.org/wiki/1788-89_United_States_presidential_election \(\rightarrow \) dbr:United_States_presidential_election,_1788-89). If this attempt also failed, we left the cell blank (no annotations available in DBpedia).

1.2 A.2 CTA Ground Truth Construction

Automatic CTA Annotations from CEA. The 2T dataset focus is mainly on the entities because, in our opinion, the CEA task is the core task: with good performance in CEA, it is possible to approximate the CTA task easily. We exploited this observation to automatically construct the CTA annotations starting from the CEA ones, which we trust. For each annotated column, we collected all the annotated entities from the CEA dataset and retrieved the most specific type for all the entities from the DBpedia 2016-10 dump.Footnote 18 We then annotate the column with the most specific supertype, i.e., the lowest common ancestor of all the types in the DBpedia 2016-10 ontology.Footnote 19

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cutrona, V., Bianchi, F., Jiménez-Ruiz, E., Palmonari, M. (2020). Tough Tables: Carefully Evaluating Entity Linking for Tabular Data. In: Pan, J.Z., et al. The Semantic Web – ISWC 2020. ISWC 2020. Lecture Notes in Computer Science(), vol 12507. Springer, Cham. https://doi.org/10.1007/978-3-030-62466-8_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-62466-8_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-62465-1

  • Online ISBN: 978-3-030-62466-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics