SocialLink: exploiting graph embeddings to link DBpedia entities to Twitter profiles

Nechaev, Yaroslav; Corcoglioniti, Francesco; Giuliano, Claudio

doi:10.1007/s13748-018-0160-x

SocialLink: exploiting graph embeddings to link DBpedia entities to Twitter profiles

Regular Paper
Published: 08 September 2018

Volume 7, pages 251–272, (2018)
Cite this article

Progress in Artificial Intelligence Aims and scope Submit manuscript

429 Accesses
6 Citations
3 Altmetric
Explore all metrics

Abstract

SocialLink is a project designed to match social media profiles on Twitter to corresponding entities in DBpedia. Built to bridge the vibrant Twitter social media world and the Linked Open Data cloud, SocialLink enables knowledge transfer between the two, both assisting Semantic Web practitioners in better harvesting the vast amounts of information available on Twitter and allowing leveraging of DBpedia data for social media analysis tasks. In this paper, we further extend the original SocialLink approach by exploiting graph-based features based on both DBpedia and Twitter, represented as graph embeddings learned from vast amounts of unlabeled data. The introduction of such new features required to redesign our deep neural network-based candidate selection algorithm and, as a result, we experimentally demonstrate a significant improvement of the performances of SocialLink.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

SocialLink: Linking DBpedia Entities to Corresponding Twitter Accounts

Fast generation of simple directed social network graphs with reciprocal edges and high clustering

Article 03 September 2022

A novel cross-network node pair embedding methodology for anchor link prediction

Article 03 April 2023

Notes

http://sociallink.futuro.media
https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego
https://github.com/Remper/sociallink
https://zenodo.org/record/820160
We start from KB entries as they are entirely known in advance, differently from social network profiles that can be only queried or (partially) acquired via expensive crawling.
English DBpedia version 2016–04, for what concerns the experiments reported in this paper (to enable comparison with original approach in [25]). The SocialLink LOD dataset released online is instead built using data from all language chapters of the most recent DBpedia.
Entity alive status is gathered from temporal properties like dbo:deathDate, dbo:deathYear, dbo:closingYear, dbo:closed, dbo:extinctionYear, dbo:extinctionDate, wikidata:P570, wikidata:P20, wikidata:P509, or properties implying death like dbo:deathPlace, dbo:deathCause, dbo:causeOfDeath.
Gold alignments derive from selected foaf:isPrimaryTopicOf and wikidata:P2002 triples of entities assumed living.
http://flink.apache.org/
See [1, Section 3.2] for a detailed description of how LSA embeddings are computed.
PageRank Split embeddings downloaded from http://data.dws.informatik.uni-mannheim.de/rdf2vec/models/DBpedia/2016--04/GlobalVectors/
The regression models described in this section perform an approximate matrix factorization rather than the exact one used, for example, by LSA.
We compare the performances of the subset with the ones of its complement using the non-paired approximate randomization test.
http://datahub.io/dataset/sociallink
http://github.com/Remper/sociallink
http://sociallink.futuro.media/sparql
http://nlp.stanford.edu/data/glove.42B.300d.zip
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

References

Aprosio, A.P., Giuliano, C., Lavelli, A.: Automatic expansion of DBpedia exploiting Wikipedia cross-language information. In: Proceedings of the Semantic Web: Semantics and Big Data, 10th International Conference, ESWC 2013, Montpellier, France, May 26–30, 2013. Lecture Notes in Computer Science, vol. 7882, pp. 397–411. Springer, Berlin (2013). https://doi.org/10.1007/978-3-642-38288-8_27
Chapter Google Scholar
Besel, C., Schlötterer, J., Granitzer, M.: Inferring semantic interest profiles from Twitter followees: Does Twitter know better than your friends? In: ACM SAC, pp. 1152–1157 (2016)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Google Scholar
Cochez, M., Ristoski, P., Ponzetto, S.P., Paulheim, H.: Biased graph walks for RDF graph embeddings. In: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics, WIMS 2017, pp. 21:1–21:12 (2017)
Cochez, M., Ristoski, P., Ponzetto, S.P., Paulheim, H.: Global RDF vector space embeddings. In: The Semantic Web-16th International Semantic Web Conference ISWC 2017, Vienna, Austria, October 21-25, 2017, Proceedings, Part I, Lecture Notes in Computer Science, vol. 10587, pp. 190–207. Springer (2017). https://doi.org/10.1007/978-3-319-68288-4_12
Google Scholar
Corcoglioniti, F., Giuliano, C., Nechaev, Y., Zanoli, R.: Pokedem: An automatic social media management application. In: Proceedings of the Eleventh ACM Conference on Recommender Systems, RecSys ’17, pp. 358–359. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3109859.3109980
Corcoglioniti, F., Palmero Aprosio, A., Nechaev, Y., Giuliano, C.: MicroNeel: Combining NLP tools to perform named entity detection and linking on microposts. In: EVALITA (2016)
Corcoglioniti, F., Rospocher, M., Mostarda, M., Amadori, M.: Processing billions of RDF triples on a single machine using streaming and sorting. In: ACM SAC, pp. 368–375 (2015)
Cristianini, N., Shawe-Taylor, J., Lodhi, H.: Latent semantic kernels. J. Intell. Inf. Syst. 18(2), 127–152 (2002). https://doi.org/10.1023/A:1013625426931
Article Google Scholar
Erxleben, F., Günther, M., Krötzsch, M., Mendez, J., Vrandeăić, D.: Introducing wikidata to the linked data web. In: Proceedings of the 13th International Semantic Web Conference-Part I, ISWC ’14, pp. 50–65. Springer, New York, NY, USA (2014). https://doi.org/10.1007/978-3-319-11964-9_4
Google Scholar
Faralli, S., Stilo, G., Velardi, P.: Large scale homophily analysis in twitter using a twixonomy. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, pp. 2334–2340 (2015)
Fetahu, B., Anand, A., Anand, A.: How much is Wikipedia lagging behind news? In: Proceedings of the ACM Web Science Conference, WebSci ’15, pp. 28:1–28:9. ACM, New York, NY, USA (2015). https://doi.org/10.1145/2786451.2786460
Goga, O.: Matching user accounts across online social networks: methods and applications. Ph.D. thesis, LIP6-Laboratoire d’Informatique de Paris 6 (2014)
Goga, O., Lei, H., Parthasarathi, S.H.K., Friedland, G., Sommer, R., Teixeira, R.: Exploiting innocuous activity for correlating users across sites. In: Proceedings of the WWW, pp. 447–458. ACM (2013)
Goga, O., Loiseau, P., Sommer, R., Teixeira, R., Gummadi, K.P.: On the reliability of profile matching across large online social networks. In: Proceedings of KDD, pp. 1799–1808. ACM (2015)
Goyal, P., Ferrara, E.: Graph embedding techniques, applications, and performance: a survey (2017). arXiv:1705.02801
Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: The 22th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 855–864. ACM (2016)
Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: YAGO2: a spatially and temporally enhanced knowledge base from wikipedia. Artifi. Intell. 194, 28–61 (2013). https://doi.org/10.1016/j.artint.2012.06.001
Article MathSciNet MATH Google Scholar
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998)
Article Google Scholar
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia—a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015). https://doi.org/10.3233/SW-140134
Article Google Scholar
Liu, S., Wang, S., Zhu, F., Zhang, J., Krishnan, R.: HYDRA: Large-scale social identity linkage via heterogeneous behavior modeling. In: Proceedings of SIGMOD, pp. 51–62. ACM (2014)
Lu, C.T., Shuai, H.H., Yu, P.S.: Identifying your customers in social networks. In: Proceedings of CIKM, pp. 391–400. ACM (2014)
Minard, A., Qwaider, M.R.H., Magnini, B.: FBK-NLP at NEEL-IT: active learning for domain adaptation. In: EVALITA (2016)
Nechaev, Y., Corcoglioniti, F., Giuliano, C.: Concealing interests of passive users in social media. In: Proceedings of the Re-coding Black Mirror 2017 Workshop co-located with 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria, 22 Oct 2017 (2017)
Nechaev, Y., Corcoglioniti, F., Giuliano, C.: Linking knowledge bases to social media profiles. In: ACM SAC, pp. 145–150 (2017)
Nechaev, Y., Corcoglioniti, F., Giuliano, C.: Sociallink: Linking dbpedia entities to corresponding Twitter accounts. In: The Semantic Web-ISWC 2017, pp. 165–174. Springer, Berlin (2017). https://doi.org/10.1007/978-3-319-68204-4_17
Google Scholar
Noreen, E.W.: Computer-Intensive Methods for Testing Hypotheses. Wiley, New York (1989)
Google Scholar
Peled, O., Fire, M., Rokach, L., Elovici, Y.: Matching entities across online social networks. Neurocomputing 210, 91–106 (2016)
Article Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pp. 701–710 (2014)
Piao, G., Breslin, J.G.: Inferring user interests for passive users on twitter by leveraging followee biographies. In: Advances in Information Retrieval 39th European Conference on IR Research, ECIR 2017, pp. 122–133 (2017)
Google Scholar
Ristoski, P., Paulheim, H.: Rdf2vec: Rdf graph embeddings for data mining. In: International Semantic Web Conference, pp. 498–514. Springer, Berlin (2016)
Google Scholar
Ristoski, P., Paulheim, H.: Semantic Web in data mining and knowledge discovery: a comprehensive survey. Web Semant. Sci. Serv. Agents World Wide Web 36, 1–22 (2016). https://doi.org/10.1016/j.websem.2016.01.001
Article Google Scholar
Ristoski, P., Rosati, J., Di Noia, T., De Leone, R., Paulheim, H.: RDF2Vec: RDF graph embeddings and their applications. Semant Web (2019, to appear). http://www.semantic-web-journal.net/content/rdf2vec-rdf-graph-embeddings-and-their-applications-1
Sadilek, A., Kautz, H., Bigham, J.P.: Finding your friends and following them to where you are. In: Proceedings of 5th ACM International Conference on Web Search and Data Mining (WSDM), pp. 723–732. ACM, New York (2012). https://doi.org/10.1145/2124295.2124380
Shazeer, N., Doherty, R., Evans, C., Waterson, C.: Swivel: Improving embeddings by noticing what’s missing. CoRR (2016). arXiv:abs/1602.02215
Zafarani, R., Liu, H.: Connecting corresponding identities across communities. In: Proceedings of ICWSM. AAAI Press (2009)
Zafarani, R., Liu, H.: Connecting users across social media sites: a behavioral-modeling approach. In: Proceedings of KDD, pp. 41–49. ACM (2013)
Zheleva, E., Getoor, L.: To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles. In: Proceedings of the 18th International Conference on World Wide Web (WWW), pp. 531–540. ACM, New York, NY, USA (2009). https://doi.org/10.1145/1526709.1526781

Download references

Author information

Authors and Affiliations

Fondazione Bruno Kessler, Via Sommarive 18, 38123, Trento, Italy
Yaroslav Nechaev, Francesco Corcoglioniti & Claudio Giuliano
University of Trento, Via Sommarive 14, 38123, Trento, Italy
Yaroslav Nechaev

Authors

Yaroslav Nechaev
View author publications
Search author on:PubMed Google Scholar
Francesco Corcoglioniti
View author publications
Search author on:PubMed Google Scholar
Claudio Giuliano
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yaroslav Nechaev.

Appendix A: On the choice of word embeddings

Pre-trained word representations, or embeddings, have become a staple technique for modeling textual data in a convenient low-dimensional form. Such word representations are typically allocated in the resulting vector space according to the distributional semantics hypothesis, i.e., the words that appear in similar contexts tend to have similar meanings and will be placed close to each other. Pre-trained word representations allow the usage of large unsupervised corpora to model the target languages efficiently. Over the last 5 years, since the introduction of the word2vec algorithm, many new approaches were proposed to improve different aspects of such representations yielding better performance than the original word2vec in many tasks. Before word2vec, methods, such as LSA, HAL and autoencoders were also widely used. However, to the best of our knowledge at the time of writing, there is still no consensus in the community about whether any of the proposed approaches is clearly superior to others and should be used by default to represent text. This conclusion is consistent with the famous “no free lunch” concept, meaning that for each task the choice of particular word embeddings should be justified and supported with experiments.

In this appendix, we measure the impact of different pre-trained word representations on our task. In our original paper, as mentioned in Sect. 3.1, we chose the latent semantic analysis (LSA) approach to represent text. The choice was mainly driven by our confidence in our LSA-based model that was used in DBpedia-related tasks before. In this paper, in order to have a precise evaluation of the new feature sets, we have opted to use the same model. However, we believe that an additional set of experiments to measure the impact of different embeddings could serve as an extra data point for the community, not to mention potentially improve the performance of our approach.

1.1 A.1 Experimental setting

We compare the previously chosen LSA model to two recent embedding types: GloVe [29] and fastText [3]. To this end, we modify the computation of the “Description” scalars in our original base feature set (see Table 3). Each scalar is computed as follows. Each user text and entity text is converted into the sparse vector $\mathbf {x}_{\text {sparse}} \in {\mathbb {R}}^v$, where v is the size of the vocabulary for the given language model. Each vector contains tf-idf scores for each token t present in a text:

$$\begin{aligned} x_t&= \text {tf}(t) \cdot \text {idf}(t,D) \\&= \log (1+\text {freq}_{t}) \cdot \log \left( 1+ \frac{ |D| }{ 1 + |\{d \in D : t \in d\}| }\right) \end{aligned}$$

where D is a corpus on which a chosen language model was trained. As can be seen, the computation of such vector requires IDF statistics from the corpus. Here we use precomputed vectors provided by the authors of the respective approaches which do not include such data along with the embeddings. Therefore, we use the same IDF scores acquired from Wikipedia for all models. Each approach is represented by the embedding matrix $M \in {\mathbb {R}}^{v \times d}$, where d is the embedding size. Then the dense text vector is acquired: $\mathbf {x}_{\text {dense}} = \mathbf {x}_{\text {sparse}}^T \cdot M$. Finally, the Description scalars are computed as a cosine similarity between the user text $\mathbf {u}_{\text {dense}}$ and the entity text $\mathbf {e}_{\text {dense}}$. In short, in order to test other representation approaches we substitute the embedding matrix $M_{\text {lsa}}$ we employed originally with $M_{\texttt {fastText}}$ and $M_{\text {GloVe}}$.

We test four different models and measure their impact on the performance of the base_kb_sg_tl approach described in the paper:

1.
LSA ($v = 972,001; d = 100$). The same LSA-based approach we used throughout the paper. The model is derived from Wikipedia and is described in [1].
2.
GloVe ($v = 1,917,494; d = 300$). The model is trained on 42B token Common Crawl corpus and provided by the authors on their website.^{Footnote 17} This approach is also described in Section 4.1 and used to populate the RDF embeddings we used.
3.
fastText ($v = 2,519,370; d = 300$). The word2vec-based model exploiting subword information. The model was provided^{Footnote 18} by the authors and is trained on Wikipedia.
4.
ALL. Description scalars produced by each model used together. Effectively an ensemble of embeddings.

Additionally, we slightly modify our data acquisition phase. During this phase, we would typically gather the textual content for each candidate from the stream of tweets. As a consequence, we would have a lot of textual content for more popular users and much less (as little as a description field in the profile) for others. This makes it so the “Description” features get a strong implicit notion of the user popularity instead of capturing only the textual similarity. To alleviate this bias, we freshly captured the most recent 200 tweets for each candidate for each entity in our gold standard using Twitter ReST API.

All approaches are evaluated using stratified fivefold cross-validation, additionally computing 95% confidence intervals and performing statistical significance test using the approximate randomization test (see Sect. 6.1).

1.2 A.2 Experimental results

Evaluation results for the four models are provided in Table 7, while Fig. 8 provides the precision/recall curves. As can be seen, the difference between the four models is minimal. However, the ALL model does provide statistically significant improvement over the LSA model we employed originally, and similarly over fastText and GloVe.

The absence of significant differences between separate models shows that our pipeline is not particularly sensitive toward the choice of the particular word embeddings. In future, the model can be modified to include textual representations in a similar way we have incorporated the graphical ones, to give more opportunity for the neural network to utilize textual data efficiently.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nechaev, Y., Corcoglioniti, F. & Giuliano, C. SocialLink: exploiting graph embeddings to link DBpedia entities to Twitter profiles. Prog Artif Intell 7, 251–272 (2018). https://doi.org/10.1007/s13748-018-0160-x

Download citation

Received: 01 March 2018
Accepted: 14 August 2018
Published: 08 September 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s13748-018-0160-x

Keywords

Profiles

Francesco Corcoglioniti View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SocialLink: exploiting graph embeddings to link DBpedia entities to Twitter profiles

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SocialLink: Linking DBpedia Entities to Corresponding Twitter Accounts

Fast generation of simple directed social network graphs with reciprocal edges and high clustering

A novel cross-network node pair embedding methodology for anchor link prediction

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix A: On the choice of word embeddings

Appendix A: On the choice of word embeddings

1.1 A.1 Experimental setting

1.2 A.2 Experimental results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now