research-article

FacetE: exploiting web tables for domain-specific word embedding evaluation

Authors:

Michael Günther,

Wolfgang LehnerAuthors Info & Claims

DBTest '20: Proceedings of the workshop on Testing Database Systems

Article No.: 5, Pages 1 - 6

https://doi.org/10.1145/3395032.3395325

Published: 19 June 2020 Publication History

Abstract

Today's natural language processing and information retrieval systems heavily depend on word embedding techniques to represent text values. However, given a specific task deciding for a word embedding dataset is not trivial. Current word embedding evaluation methods mostly provide only a one-dimensional quality measure, which does not express how knowledge from different domains is represented in the word embedding models. To overcome this limitation, we provide a new evaluation data set called FacetE derived from 125M Web tables, enabling domain-sensitive evaluation. We show that FacetE can effectively be used to evaluate word embedding models. The evaluation of common general-purpose word embedding models suggests that there is currently no best word embedding for every domain.

References

[1]

O. Avraham and Y. Goldberg. 2016. Improving Reliability ofWord Similarity Evaluation by Redesigning Annotation Task and Performance Measure. In RepEval. ACL, 106--110.

[2]

A. Bakarov. 2018. A Survey of Word Embeddings Evaluation Methods. CoRR abs/1801.09536 (2018).

[3]

M. Baroni, G. Dinu, and G. Kruszewski. 2014. Don't count, predict! A Systematic Comparison of Context-Counting vs. Context-Predicting Semantic Vectors. In ACL. 238--247.

[4]

M. Baroni, S. Evert, and A. Lenci. 2008. Bridging the Gap between Semantic Theory and Computational Simulations. ESSLLI Workshop on Distributional Lexical Semantics (2008).

[5]

M. Batchkarov, T. Kober, J. Reffin, J. Weeds, and D. Weir. 2016. A Critique of Word Similarity as a Method for Evaluating Distributional Semantic Models. In RepEval. ACL, 7--12.

[6]

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching Word Vectors with Subword Information. TACL 5 (2017), 135--146.

[7]

R. Bordawekar and O. Shmueli. 2017. Using Word Embedding to Enable Semantic Queries in Relational Databases. In DEEM Workshop. 1--4.

[8]

V. Broughton and H. Lane. 2000. Classification Schemes Revisited: Applications to Web Indexing and Searching. Journal of internet cataloging 2, 3-4 (2000), 143--155.

[9]

E. Bruni, G. Boleda, M. Baroni, and N.-K. Tran. 2012. Distributional Semantics in Technicolor. In ACL. ACL, 136--145.

[10]

J. Camacho-Collados and R. Navigli. 2016. Find the Word that does not belong: A Framework for an Intrinsic Evaluation of Word Vector Representations. In RepEval. 43--50.

[11]

D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia. 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In SemEval. 1--14.

[12]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. ACL, 4171--4186.

[13]

J. Eberius, K. Braunschweig, M. Hentsch, M. Thiele, A. Ahmadov, and W. Lehner. 2015. Building the Dresden Web Table Corpus: A Classification Approach. In BDC. IEEE, 41--50.

[14]

M Faruqui, J. Dodge, Sujay K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith. 2015. Retrofitting Word Vectors to Semantic Lexicons. In NAACL. 1606--1615.

[15]

L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. 2001. Placing Search in Context: The Concept Revisited. In WWW. 406--414.

[16]

P. Gamallo. 2019. Using the Outlier Detection Task to Evaluate Distributional Semantic Models. Machine Learning and Knowledge Extraction 1, 1 (2019), 211--223.

[17]

B. Gao, J. Bian, and T.-Y. Liu. 2014. WordRep: A Benchmark for Research on Learning Word Representations. CoRR abs/1407.1640 (2014).

[18]

D. Gerz, I. Vulić, F. Hill, R. Reichart, and A. Korhonen. 2016. SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity. In EMNLP. 2173--2182.

[19]

M. Günther. 2018. FREDDY: Fast Word Embeddings in Database Systems. In SIGMOD. ACM, 1817--1819.

Digital Library

[20]

M. Günther, P. Sikorski, M. Thiele, and W. Lehner. 2020. FacetE. (2020). https: //

[21]

M. Günther, M. Thiele, and W. Lehner. 2020. RETRO: Relation Retrofitting For In-Database Machine Learning on Textual Data. In EDBT. 411--414.

[22]

G. Halawi, G. Dror, E. Gabrilovich, and Y. Koren. 2012. Large-Scale Learning of Word Relatedness with Constraints. In SIGKDD. 1406--1414.

[23]

S. C. Herring. 2007. A Faceted Classification Scheme for Computer-Mediated Discourse. Language@ Internet 4, 1 (2007).

[24]

F. Hill, R. Reichart, and A. Korhonen. 2015. Simlex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation. Computational Linguistics 41, 4 (2015), 665--695.

Digital Library

[25]

O. Lehmberg and C. Bizer. 2019. Synthesizing N-ary Relations from Web Tables. In WIMS. 1--12.

[26]

O. Levy and Y. Goldberg. 2014. Linguistic Regularities in Sparse and Explicit Word Representations. In CoNLL. 171--180.

[27]

R. Levy, L. E. Dor, S. Hummel, R. Rinott, and N. Slonim. 2015. TR9856: A Multi-Word Term Relatedness Benchmark. In ACL. 419--424.

[28]

M.-T. Luong, R. Socher, and C. D. Manning. 2013. Better Word Representations with Recursive Neural Networks for Morphology. In CoNLL. 104--113.

[29]

T. Mikolov, K. Chen, G. Corrado., and J. Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICLR, Workshop Track.

[30]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS. Curran Associates, Inc., 3111--3119.

[31]

T. Mikolov, W. Yih, and G. Zweig. 2013. Linguistic Regularities in Continuous Space Word Representations. In NAACL. 746--751.

[32]

F. Nooralahzadeh, L. Ovrelid, and J. T. Lonning. 2018. Evaluation of Domainspecific Word Embeddings using Knowledge Resources. In LREC. European Language Resources Association (ELRA), Miyazaki, Japan, 1438--1445.

[33]

R. Prieto-Diaz. 1991. Implementing Faceted Classification for Software Reuse. Commun. ACM 34, 5 (1991), 88--97.

Digital Library

[34]

S. R. Ranganathan. 1939. Colon Classification. Madras Library Association, Madras.

[35]

N. Reimers and I. Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP-IJCNLP. 3973--3983.

[36]

J. Risch and R. Krestel. 2019. Domain-specific Word Embeddings for Patent Classification. Data Technologies and Applications (2019), 108--122.

[37]

H. Rubenstein and J. B. Goodenough. 1965. Contextual Correlates of Synonymy. Commun. ACM 8, 10 (1965), 627--633.

Digital Library

[38]

P. K. Sarma, Y. Liang, and B. Sethares. 2018. Domain Adapted Word Embeddings for Improved Sentiment Classification. In Workshop on Deep Learning Approaches for Low-Resource NLP. 51--59.

[39]

T. Schnabel, I. Labutov, D. M. Mimno, and T. Joachims. 2015. Evaluation Methods for Unsupervised Word Embeddings. In EMNLP. ACL, 298--307.

[40]

Y. Wang, S. Liu, N. Afzal, M. Rastegar-Mojarad, L. Wang, F. Shen, P. Kingsbury, and H. Liu. 2018. A Comparison of Word Embeddings for the Biomedical Natural Language Processing. Journal of Biomedical Informatics 87 (2018), 12--20.

[41]

X. Zhou, X.Wan, and J. Xiao. 2015. Representation Learning for Aspect Category Detection in Online Reviews. In AAAI. 417--423.

[42]

G. Zuccon, B. Koopman, P. Bruza, and L. Azzopardi. 2015. Integrating and Evaluating Neural Word Embeddings in Information Retrieval. In ADCS'15. ACM.

Cited By

Giancani SAlbertoni RCatalano C(2023)Quality of word and concept embeddings in targetted biomedical domainsHeliyon10.1016/j.heliyon.2023.e168189:6(e16818)Online publication date: Jun-2023
https://doi.org/10.1016/j.heliyon.2023.e16818

FacetE: exploiting web tables for domain-specific word embedding evaluation
1. Computing methodologies
  1. Artificial intelligence

Recommendations

A region-adaptive semi-fragile dual watermarking scheme

Since existing watermarking schemes usually have only a single function, a region-adaptive semi-fragile dual watermarking scheme is proposed, taking into account both watermark embedding capacity and security. The dual watermarks refer to the robust ...
A steganographic method based upon JPEG and quantization table modification
Special issue: Intelligent multimedia computing and networking

In this paper, a novel steganographic method based on joint photographic expert-group (JPEG) is proposed. The proposed method modifies the quantization table first. Next, the secret message is hidden in the cover-image with its middle-frequency of the ...
Visible watermarking with reversibility of multimedia images for ownership declarations

Digital watermarking technology is primarily the joining of the rightful owner of the protected media. Once the media are suspected to be illegally used, an open algorithm can be used to extract the digital watermark for the purpose of showing the media'...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DBTest '20: Proceedings of the workshop on Testing Database Systems

June 2020

42 pages

ISBN:9781450380010

DOI:10.1145/3395032

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 19, 2020

Oregon, Portland

Acceptance Rates

DBTest '20 Paper Acceptance Rate 7 of 10 submissions, 70%;

Overall Acceptance Rate 31 of 56 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
102
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Giancani SAlbertoni RCatalano C(2023)Quality of word and concept embeddings in targetted biomedical domainsHeliyon10.1016/j.heliyon.2023.e168189:6(e16818)Online publication date: Jun-2023
https://doi.org/10.1016/j.heliyon.2023.e16818

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten