skip to main content
10.1145/3395032.3395325acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

FacetE: exploiting web tables for domain-specific word embedding evaluation

Published: 19 June 2020 Publication History

Abstract

Today's natural language processing and information retrieval systems heavily depend on word embedding techniques to represent text values. However, given a specific task deciding for a word embedding dataset is not trivial. Current word embedding evaluation methods mostly provide only a one-dimensional quality measure, which does not express how knowledge from different domains is represented in the word embedding models. To overcome this limitation, we provide a new evaluation data set called FacetE derived from 125M Web tables, enabling domain-sensitive evaluation. We show that FacetE can effectively be used to evaluate word embedding models. The evaluation of common general-purpose word embedding models suggests that there is currently no best word embedding for every domain.

References

[1]
O. Avraham and Y. Goldberg. 2016. Improving Reliability ofWord Similarity Evaluation by Redesigning Annotation Task and Performance Measure. In RepEval. ACL, 106--110.
[2]
A. Bakarov. 2018. A Survey of Word Embeddings Evaluation Methods. CoRR abs/1801.09536 (2018).
[3]
M. Baroni, G. Dinu, and G. Kruszewski. 2014. Don't count, predict! A Systematic Comparison of Context-Counting vs. Context-Predicting Semantic Vectors. In ACL. 238--247.
[4]
M. Baroni, S. Evert, and A. Lenci. 2008. Bridging the Gap between Semantic Theory and Computational Simulations. ESSLLI Workshop on Distributional Lexical Semantics (2008).
[5]
M. Batchkarov, T. Kober, J. Reffin, J. Weeds, and D. Weir. 2016. A Critique of Word Similarity as a Method for Evaluating Distributional Semantic Models. In RepEval. ACL, 7--12.
[6]
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching Word Vectors with Subword Information. TACL 5 (2017), 135--146.
[7]
R. Bordawekar and O. Shmueli. 2017. Using Word Embedding to Enable Semantic Queries in Relational Databases. In DEEM Workshop. 1--4.
[8]
V. Broughton and H. Lane. 2000. Classification Schemes Revisited: Applications to Web Indexing and Searching. Journal of internet cataloging 2, 3-4 (2000), 143--155.
[9]
E. Bruni, G. Boleda, M. Baroni, and N.-K. Tran. 2012. Distributional Semantics in Technicolor. In ACL. ACL, 136--145.
[10]
J. Camacho-Collados and R. Navigli. 2016. Find the Word that does not belong: A Framework for an Intrinsic Evaluation of Word Vector Representations. In RepEval. 43--50.
[11]
D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia. 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In SemEval. 1--14.
[12]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. ACL, 4171--4186.
[13]
J. Eberius, K. Braunschweig, M. Hentsch, M. Thiele, A. Ahmadov, and W. Lehner. 2015. Building the Dresden Web Table Corpus: A Classification Approach. In BDC. IEEE, 41--50.
[14]
M Faruqui, J. Dodge, Sujay K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith. 2015. Retrofitting Word Vectors to Semantic Lexicons. In NAACL. 1606--1615.
[15]
L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. 2001. Placing Search in Context: The Concept Revisited. In WWW. 406--414.
[16]
P. Gamallo. 2019. Using the Outlier Detection Task to Evaluate Distributional Semantic Models. Machine Learning and Knowledge Extraction 1, 1 (2019), 211--223.
[17]
B. Gao, J. Bian, and T.-Y. Liu. 2014. WordRep: A Benchmark for Research on Learning Word Representations. CoRR abs/1407.1640 (2014).
[18]
D. Gerz, I. Vulić, F. Hill, R. Reichart, and A. Korhonen. 2016. SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity. In EMNLP. 2173--2182.
[19]
M. Günther. 2018. FREDDY: Fast Word Embeddings in Database Systems. In SIGMOD. ACM, 1817--1819.
[20]
M. Günther, P. Sikorski, M. Thiele, and W. Lehner. 2020. FacetE. (2020). https: //
[21]
M. Günther, M. Thiele, and W. Lehner. 2020. RETRO: Relation Retrofitting For In-Database Machine Learning on Textual Data. In EDBT. 411--414.
[22]
G. Halawi, G. Dror, E. Gabrilovich, and Y. Koren. 2012. Large-Scale Learning of Word Relatedness with Constraints. In SIGKDD. 1406--1414.
[23]
S. C. Herring. 2007. A Faceted Classification Scheme for Computer-Mediated Discourse. Language@ Internet 4, 1 (2007).
[24]
F. Hill, R. Reichart, and A. Korhonen. 2015. Simlex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation. Computational Linguistics 41, 4 (2015), 665--695.
[25]
O. Lehmberg and C. Bizer. 2019. Synthesizing N-ary Relations from Web Tables. In WIMS. 1--12.
[26]
O. Levy and Y. Goldberg. 2014. Linguistic Regularities in Sparse and Explicit Word Representations. In CoNLL. 171--180.
[27]
R. Levy, L. E. Dor, S. Hummel, R. Rinott, and N. Slonim. 2015. TR9856: A Multi-Word Term Relatedness Benchmark. In ACL. 419--424.
[28]
M.-T. Luong, R. Socher, and C. D. Manning. 2013. Better Word Representations with Recursive Neural Networks for Morphology. In CoNLL. 104--113.
[29]
T. Mikolov, K. Chen, G. Corrado., and J. Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICLR, Workshop Track.
[30]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS. Curran Associates, Inc., 3111--3119.
[31]
T. Mikolov, W. Yih, and G. Zweig. 2013. Linguistic Regularities in Continuous Space Word Representations. In NAACL. 746--751.
[32]
F. Nooralahzadeh, L. Ovrelid, and J. T. Lonning. 2018. Evaluation of Domainspecific Word Embeddings using Knowledge Resources. In LREC. European Language Resources Association (ELRA), Miyazaki, Japan, 1438--1445.
[33]
R. Prieto-Diaz. 1991. Implementing Faceted Classification for Software Reuse. Commun. ACM 34, 5 (1991), 88--97.
[34]
S. R. Ranganathan. 1939. Colon Classification. Madras Library Association, Madras.
[35]
N. Reimers and I. Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP-IJCNLP. 3973--3983.
[36]
J. Risch and R. Krestel. 2019. Domain-specific Word Embeddings for Patent Classification. Data Technologies and Applications (2019), 108--122.
[37]
H. Rubenstein and J. B. Goodenough. 1965. Contextual Correlates of Synonymy. Commun. ACM 8, 10 (1965), 627--633.
[38]
P. K. Sarma, Y. Liang, and B. Sethares. 2018. Domain Adapted Word Embeddings for Improved Sentiment Classification. In Workshop on Deep Learning Approaches for Low-Resource NLP. 51--59.
[39]
T. Schnabel, I. Labutov, D. M. Mimno, and T. Joachims. 2015. Evaluation Methods for Unsupervised Word Embeddings. In EMNLP. ACL, 298--307.
[40]
Y. Wang, S. Liu, N. Afzal, M. Rastegar-Mojarad, L. Wang, F. Shen, P. Kingsbury, and H. Liu. 2018. A Comparison of Word Embeddings for the Biomedical Natural Language Processing. Journal of Biomedical Informatics 87 (2018), 12--20.
[41]
X. Zhou, X.Wan, and J. Xiao. 2015. Representation Learning for Aspect Category Detection in Online Reviews. In AAAI. 417--423.
[42]
G. Zuccon, B. Koopman, P. Bruza, and L. Azzopardi. 2015. Integrating and Evaluating Neural Word Embeddings in Information Retrieval. In ADCS'15. ACM.

Cited By

View all
  • (2023)Quality of word and concept embeddings in targetted biomedical domainsHeliyon10.1016/j.heliyon.2023.e168189:6(e16818)Online publication date: Jun-2023
  1. FacetE: exploiting web tables for domain-specific word embedding evaluation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    DBTest '20: Proceedings of the workshop on Testing Database Systems
    June 2020
    42 pages
    ISBN:9781450380010
    DOI:10.1145/3395032
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 June 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGMOD/PODS '20
    Sponsor:

    Acceptance Rates

    DBTest '20 Paper Acceptance Rate 7 of 10 submissions, 70%;
    Overall Acceptance Rate 31 of 56 submissions, 55%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Quality of word and concept embeddings in targetted biomedical domainsHeliyon10.1016/j.heliyon.2023.e168189:6(e16818)Online publication date: Jun-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media