Skip to main content
Log in

SocialLink: exploiting graph embeddings to link DBpedia entities to Twitter profiles

  • Regular Paper
  • Published:
Progress in Artificial Intelligence Aims and scope Submit manuscript

Abstract

SocialLink is a project designed to match social media profiles on Twitter to corresponding entities in DBpedia. Built to bridge the vibrant Twitter social media world and the Linked Open Data cloud, SocialLink enables knowledge transfer between the two, both assisting Semantic Web practitioners in better harvesting the vast amounts of information available on Twitter and allowing leveraging of DBpedia data for social media analysis tasks. In this paper, we further extend the original SocialLink approach by exploiting graph-based features based on both DBpedia and Twitter, represented as graph embeddings learned from vast amounts of unlabeled data. The introduction of such new features required to redesign our deep neural network-based candidate selection algorithm and, as a result, we experimentally demonstrate a significant improvement of the performances of SocialLink.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://sociallink.futuro.media

  2. https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego

  3. https://github.com/Remper/sociallink

  4. https://zenodo.org/record/820160

  5. We start from KB entries as they are entirely known in advance, differently from social network profiles that can be only queried or (partially) acquired via expensive crawling.

  6. English DBpedia version 2016–04, for what concerns the experiments reported in this paper (to enable comparison with original approach in [25]). The SocialLink LOD dataset released online is instead built using data from all language chapters of the most recent DBpedia.

  7. Entity alive status is gathered from temporal properties like dbo:deathDate, dbo:deathYear, dbo:closingYear, dbo:closed, dbo:extinctionYear, dbo:extinctionDate, wikidata:P570, wikidata:P20, wikidata:P509, or properties implying death like dbo:deathPlace, dbo:deathCause, dbo:causeOfDeath.

  8. Gold alignments derive from selected foaf:isPrimaryTopicOf and wikidata:P2002 triples of entities assumed living.

  9. http://flink.apache.org/

  10. See [1, Section 3.2] for a detailed description of how LSA embeddings are computed.

  11. PageRank Split embeddings downloaded from http://data.dws.informatik.uni-mannheim.de/rdf2vec/models/DBpedia/2016--04/GlobalVectors/

  12. The regression models described in this section perform an approximate matrix factorization rather than the exact one used, for example, by LSA.

  13. We compare the performances of the subset with the ones of its complement using the non-paired approximate randomization test.

  14. http://datahub.io/dataset/sociallink

  15. http://github.com/Remper/sociallink

  16. http://sociallink.futuro.media/sparql

  17. http://nlp.stanford.edu/data/glove.42B.300d.zip

  18. https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

References

  1. Aprosio, A.P., Giuliano, C., Lavelli, A.: Automatic expansion of DBpedia exploiting Wikipedia cross-language information. In: Proceedings of the Semantic Web: Semantics and Big Data, 10th International Conference, ESWC 2013, Montpellier, France, May 26–30, 2013. Lecture Notes in Computer Science, vol. 7882, pp. 397–411. Springer, Berlin (2013). https://doi.org/10.1007/978-3-642-38288-8_27

    Chapter  Google Scholar 

  2. Besel, C., Schlötterer, J., Granitzer, M.: Inferring semantic interest profiles from Twitter followees: Does Twitter know better than your friends? In: ACM SAC, pp. 1152–1157 (2016)

  3. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Google Scholar 

  4. Cochez, M., Ristoski, P., Ponzetto, S.P., Paulheim, H.: Biased graph walks for RDF graph embeddings. In: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics, WIMS 2017, pp. 21:1–21:12 (2017)

  5. Cochez, M., Ristoski, P., Ponzetto, S.P., Paulheim, H.: Global RDF vector space embeddings. In: The Semantic Web-16th International Semantic Web Conference ISWC 2017, Vienna, Austria, October 21-25, 2017, Proceedings, Part I, Lecture Notes in Computer Science, vol. 10587, pp. 190–207. Springer (2017). https://doi.org/10.1007/978-3-319-68288-4_12

    Google Scholar 

  6. Corcoglioniti, F., Giuliano, C., Nechaev, Y., Zanoli, R.: Pokedem: An automatic social media management application. In: Proceedings of the Eleventh ACM Conference on Recommender Systems, RecSys ’17, pp. 358–359. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3109859.3109980

  7. Corcoglioniti, F., Palmero Aprosio, A., Nechaev, Y., Giuliano, C.: MicroNeel: Combining NLP tools to perform named entity detection and linking on microposts. In: EVALITA (2016)

  8. Corcoglioniti, F., Rospocher, M., Mostarda, M., Amadori, M.: Processing billions of RDF triples on a single machine using streaming and sorting. In: ACM SAC, pp. 368–375 (2015)

  9. Cristianini, N., Shawe-Taylor, J., Lodhi, H.: Latent semantic kernels. J. Intell. Inf. Syst. 18(2), 127–152 (2002). https://doi.org/10.1023/A:1013625426931

    Article  Google Scholar 

  10. Erxleben, F., Günther, M., Krötzsch, M., Mendez, J., Vrandeăić, D.: Introducing wikidata to the linked data web. In: Proceedings of the 13th International Semantic Web Conference-Part I, ISWC ’14, pp. 50–65. Springer, New York, NY, USA (2014). https://doi.org/10.1007/978-3-319-11964-9_4

    Google Scholar 

  11. Faralli, S., Stilo, G., Velardi, P.: Large scale homophily analysis in twitter using a twixonomy. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, pp. 2334–2340 (2015)

  12. Fetahu, B., Anand, A., Anand, A.: How much is Wikipedia lagging behind news? In: Proceedings of the ACM Web Science Conference, WebSci ’15, pp. 28:1–28:9. ACM, New York, NY, USA (2015). https://doi.org/10.1145/2786451.2786460

  13. Goga, O.: Matching user accounts across online social networks: methods and applications. Ph.D. thesis, LIP6-Laboratoire d’Informatique de Paris 6 (2014)

  14. Goga, O., Lei, H., Parthasarathi, S.H.K., Friedland, G., Sommer, R., Teixeira, R.: Exploiting innocuous activity for correlating users across sites. In: Proceedings of the WWW, pp. 447–458. ACM (2013)

  15. Goga, O., Loiseau, P., Sommer, R., Teixeira, R., Gummadi, K.P.: On the reliability of profile matching across large online social networks. In: Proceedings of KDD, pp. 1799–1808. ACM (2015)

  16. Goyal, P., Ferrara, E.: Graph embedding techniques, applications, and performance: a survey (2017). arXiv:1705.02801

  17. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: The 22th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 855–864. ACM (2016)

  18. Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: YAGO2: a spatially and temporally enhanced knowledge base from wikipedia. Artifi. Intell. 194, 28–61 (2013). https://doi.org/10.1016/j.artint.2012.06.001

    Article  MathSciNet  MATH  Google Scholar 

  19. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998)

    Article  Google Scholar 

  20. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia—a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015). https://doi.org/10.3233/SW-140134

    Article  Google Scholar 

  21. Liu, S., Wang, S., Zhu, F., Zhang, J., Krishnan, R.: HYDRA: Large-scale social identity linkage via heterogeneous behavior modeling. In: Proceedings of SIGMOD, pp. 51–62. ACM (2014)

  22. Lu, C.T., Shuai, H.H., Yu, P.S.: Identifying your customers in social networks. In: Proceedings of CIKM, pp. 391–400. ACM (2014)

  23. Minard, A., Qwaider, M.R.H., Magnini, B.: FBK-NLP at NEEL-IT: active learning for domain adaptation. In: EVALITA (2016)

  24. Nechaev, Y., Corcoglioniti, F., Giuliano, C.: Concealing interests of passive users in social media. In: Proceedings of the Re-coding Black Mirror 2017 Workshop co-located with 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria, 22 Oct 2017 (2017)

  25. Nechaev, Y., Corcoglioniti, F., Giuliano, C.: Linking knowledge bases to social media profiles. In: ACM SAC, pp. 145–150 (2017)

  26. Nechaev, Y., Corcoglioniti, F., Giuliano, C.: Sociallink: Linking dbpedia entities to corresponding Twitter accounts. In: The Semantic Web-ISWC 2017, pp. 165–174. Springer, Berlin (2017). https://doi.org/10.1007/978-3-319-68204-4_17

    Google Scholar 

  27. Noreen, E.W.: Computer-Intensive Methods for Testing Hypotheses. Wiley, New York (1989)

    Google Scholar 

  28. Peled, O., Fire, M., Rokach, L., Elovici, Y.: Matching entities across online social networks. Neurocomputing 210, 91–106 (2016)

    Article  Google Scholar 

  29. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

  30. Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pp. 701–710 (2014)

  31. Piao, G., Breslin, J.G.: Inferring user interests for passive users on twitter by leveraging followee biographies. In: Advances in Information Retrieval 39th European Conference on IR Research, ECIR 2017, pp. 122–133 (2017)

    Google Scholar 

  32. Ristoski, P., Paulheim, H.: Rdf2vec: Rdf graph embeddings for data mining. In: International Semantic Web Conference, pp. 498–514. Springer, Berlin (2016)

    Google Scholar 

  33. Ristoski, P., Paulheim, H.: Semantic Web in data mining and knowledge discovery: a comprehensive survey. Web Semant. Sci. Serv. Agents World Wide Web 36, 1–22 (2016). https://doi.org/10.1016/j.websem.2016.01.001

    Article  Google Scholar 

  34. Ristoski, P., Rosati, J., Di Noia, T., De Leone, R., Paulheim, H.: RDF2Vec: RDF graph embeddings and their applications. Semant Web (2019, to appear). http://www.semantic-web-journal.net/content/rdf2vec-rdf-graph-embeddings-and-their-applications-1

  35. Sadilek, A., Kautz, H., Bigham, J.P.: Finding your friends and following them to where you are. In: Proceedings of 5th ACM International Conference on Web Search and Data Mining (WSDM), pp. 723–732. ACM, New York (2012). https://doi.org/10.1145/2124295.2124380

  36. Shazeer, N., Doherty, R., Evans, C., Waterson, C.: Swivel: Improving embeddings by noticing what’s missing. CoRR (2016). arXiv:abs/1602.02215

  37. Zafarani, R., Liu, H.: Connecting corresponding identities across communities. In: Proceedings of ICWSM. AAAI Press (2009)

  38. Zafarani, R., Liu, H.: Connecting users across social media sites: a behavioral-modeling approach. In: Proceedings of KDD, pp. 41–49. ACM (2013)

  39. Zheleva, E., Getoor, L.: To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles. In: Proceedings of the 18th International Conference on World Wide Web (WWW), pp. 531–540. ACM, New York, NY, USA (2009). https://doi.org/10.1145/1526709.1526781

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yaroslav Nechaev.

Appendix A: On the choice of word embeddings

Appendix A: On the choice of word embeddings

Pre-trained word representations, or embeddings, have become a staple technique for modeling textual data in a convenient low-dimensional form. Such word representations are typically allocated in the resulting vector space according to the distributional semantics hypothesis, i.e., the words that appear in similar contexts tend to have similar meanings and will be placed close to each other. Pre-trained word representations allow the usage of large unsupervised corpora to model the target languages efficiently. Over the last 5 years, since the introduction of the word2vec algorithm, many new approaches were proposed to improve different aspects of such representations yielding better performance than the original word2vec in many tasks. Before word2vec, methods, such as LSA, HAL and autoencoders were also widely used. However, to the best of our knowledge at the time of writing, there is still no consensus in the community about whether any of the proposed approaches is clearly superior to others and should be used by default to represent text. This conclusion is consistent with the famous “no free lunch” concept, meaning that for each task the choice of particular word embeddings should be justified and supported with experiments.

In this appendix, we measure the impact of different pre-trained word representations on our task. In our original paper, as mentioned in Sect. 3.1, we chose the latent semantic analysis (LSA) approach to represent text. The choice was mainly driven by our confidence in our LSA-based model that was used in DBpedia-related tasks before. In this paper, in order to have a precise evaluation of the new feature sets, we have opted to use the same model. However, we believe that an additional set of experiments to measure the impact of different embeddings could serve as an extra data point for the community, not to mention potentially improve the performance of our approach.

1.1 A.1 Experimental setting

We compare the previously chosen LSA model to two recent embedding types: GloVe [29] and fastText [3]. To this end, we modify the computation of the “Description” scalars in our original base feature set (see Table 3). Each scalar is computed as follows. Each user text and entity text is converted into the sparse vector \(\mathbf {x}_{\text {sparse}} \in {\mathbb {R}}^v\), where v is the size of the vocabulary for the given language model. Each vector contains tf-idf scores for each token t present in a text:

$$\begin{aligned} x_t&= \text {tf}(t) \cdot \text {idf}(t,D) \\&= \log (1+\text {freq}_{t}) \cdot \log \left( 1+ \frac{ |D| }{ 1 + |\{d \in D : t \in d\}| }\right) \end{aligned}$$

where D is a corpus on which a chosen language model was trained. As can be seen, the computation of such vector requires IDF statistics from the corpus. Here we use precomputed vectors provided by the authors of the respective approaches which do not include such data along with the embeddings. Therefore, we use the same IDF scores acquired from Wikipedia for all models. Each approach is represented by the embedding matrix \(M \in {\mathbb {R}}^{v \times d}\), where d is the embedding size. Then the dense text vector is acquired: \(\mathbf {x}_{\text {dense}} = \mathbf {x}_{\text {sparse}}^T \cdot M\). Finally, the Description scalars are computed as a cosine similarity between the user text \(\mathbf {u}_{\text {dense}}\) and the entity text \(\mathbf {e}_{\text {dense}}\). In short, in order to test other representation approaches we substitute the embedding matrix \(M_{\text {lsa}}\) we employed originally with \(M_{\texttt {fastText}}\) and \(M_{\text {GloVe}}\).

We test four different models and measure their impact on the performance of the base_kb_sg_tl approach described in the paper:

  1. 1.

    LSA (\(v = 972,001; d = 100\)). The same LSA-based approach we used throughout the paper. The model is derived from Wikipedia and is described in [1].

  2. 2.

    GloVe (\(v = 1,917,494; d = 300\)). The model is trained on 42B token Common Crawl corpus and provided by the authors on their website.Footnote 17 This approach is also described in Section 4.1 and used to populate the RDF embeddings we used.

  3. 3.

    fastText (\(v = 2,519,370; d = 300\)). The word2vec-based model exploiting subword information. The model was providedFootnote 18 by the authors and is trained on Wikipedia.

  4. 4.

    ALL. Description scalars produced by each model used together. Effectively an ensemble of embeddings.

Additionally, we slightly modify our data acquisition phase. During this phase, we would typically gather the textual content for each candidate from the stream of tweets. As a consequence, we would have a lot of textual content for more popular users and much less (as little as a description field in the profile) for others. This makes it so the “Description” features get a strong implicit notion of the user popularity instead of capturing only the textual similarity. To alleviate this bias, we freshly captured the most recent 200 tweets for each candidate for each entity in our gold standard using Twitter ReST API.

Fig. 8
figure 8

P/R curves of four models using the base_kb_sg_tl model

All approaches are evaluated using stratified fivefold cross-validation, additionally computing 95% confidence intervals and performing statistical significance test using the approximate randomization test (see Sect. 6.1).

1.2 A.2 Experimental results

Evaluation results for the four models are provided in Table 7, while Fig. 8 provides the precision/recall curves. As can be seen, the difference between the four models is minimal. However, the ALL model does provide statistically significant improvement over the LSA model we employed originally, and similarly over fastText and GloVe.

The absence of significant differences between separate models shows that our pipeline is not particularly sensitive toward the choice of the particular word embeddings. In future, the model can be modified to include textual representations in a similar way we have incorporated the graphical ones, to give more opportunity for the neural network to utilize textual data efficiently.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nechaev, Y., Corcoglioniti, F. & Giuliano, C. SocialLink: exploiting graph embeddings to link DBpedia entities to Twitter profiles. Prog Artif Intell 7, 251–272 (2018). https://doi.org/10.1007/s13748-018-0160-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13748-018-0160-x

Keywords

Navigation