Vector-Based Similarity Measurements for Historical Figures

Chen, Yanqing; Perozzi, Bryan; Skiena, Steven

doi:10.1007/978-3-319-25087-8_17

Vector-Based Similarity Measurements for Historical Figures

Yanqing Chen¹⁷,
Bryan Perozzi¹⁷ &
Steven Skiena¹⁷

Conference paper
First Online: 17 October 2015

1104 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9371))

Abstract

Historical interpretation benefits from identifying analogies among famous people: Who are the Lincolns, Einsteins, Hitlers, and Mozarts? We investigate several approaches to convert approximately 600,000 historical figures into vector representations to quantify similarity according to their Wikipedia pages. We adopt an effective reference standard based on the number of human-annotated Wikipedia categories being shared and use this to demonstrate the performance of our similarity detection algorithms. In particular, we investigate four different unsupervised approaches to representing the semantic associations of individuals: (1) TF-IDF, (2) Weighted average of distributed word embedding, (3) LDA Topic analysis and (4) Deepwalk embedding from page links. All proved effective, but Deepwalk embedding yielded an overall accuracy of 91.33% in our evaluation to uncover historical analogies. Combining LDA and Deepwalk yielded even higher performance.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Al-Rfou, R., Perozzi, B., Skiena, S.: Polyglot: Distributed word representations for multilingual NLP. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 183–192 (2013)
Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. Journal of Machine Learning Research 3, 1137–1155 (2003)
MATH Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Chen, Y., Perozzi, B., Al-Rfou, R., Skiena, S.: The expressive power of word embeddings. In: ICML 2013 Workshop on Deep Learning for Audio, Speech, and Language Processing (2013)
Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. The Journal of Machine Learning Research 12, 2493–2537 (2011)
MATH Google Scholar
Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with mapreduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies (2008)
Google Scholar
Fellbaum, C.: WordNet. In: Theory and Applications of Ontology: Computer Applications, pp. 231–243 (2010)
Google Scholar
Huang, A.: Similarity measures for text document clustering. In: Proceedings of the Sixth New Zealand Computer Science Research Student Conference (2008)
Google Scholar
Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (2012)
Google Scholar
Kim, M., Zhang, B.T., Lee, J.S.: Subjective document classification using network analysis. In: 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 365–369. IEEE (2010)
Google Scholar
Krestel, R., Fankhauser, P., Nejdl, W.: Latent dirichlet allocation for tag recommendation. In: Proceedings of the third ACM conference on Recommender Systems, pp. 61–68. ACM (2009)
Google Scholar
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-2014), pp. 1188–1196 (2014)
Google Scholar
Maiya, A.S., Rolfe, R.M.: Topic similarity networks: visual analytics for large document sets. In: 2014 IEEE International Conference on Big Data (Big Data), pp. 364–372. IEEE (2014)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781
Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701–710 (2014)
Google Scholar
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50 (2010)
Google Scholar
Skiena, S., Ward, C.B.: Who’s Bigger?: Where Historical Figures Really Rank. Cambridge University Press (2013)
Google Scholar
Wang, C., Yu, X., Li, Y., Zhai, C., Han, J.: Content coverage maximization on word networks for hierarchical topic summarization. In: Proceedings of the 22nd ACM international conference on Conference on Information & Knowledge Management, pp. 249–258. ACM (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Stony Brook University, Stony Brook, NY, 11794, USA
Yanqing Chen, Bryan Perozzi & Steven Skiena

Authors

Yanqing Chen
View author publications
You can also search for this author in PubMed Google Scholar
Bryan Perozzi
View author publications
You can also search for this author in PubMed Google Scholar
Steven Skiena
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steven Skiena .

Editor information

Editors and Affiliations

ISTI-CNR, Pisa, Italy
Giuseppe Amato
University of Strathclyde, Glasgow, United Kingdom
Richard Connor
ISTI-CNR, Pisa, Italy
Fabrizio Falchi
ISTI-CNR, Pisa, Italy
Claudio Gennaro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Y., Perozzi, B., Skiena, S. (2015). Vector-Based Similarity Measurements for Historical Figures. In: Amato, G., Connor, R., Falchi, F., Gennaro, C. (eds) Similarity Search and Applications. SISAP 2015. Lecture Notes in Computer Science(), vol 9371. Springer, Cham. https://doi.org/10.1007/978-3-319-25087-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-25087-8_17
Published: 17 October 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25086-1
Online ISBN: 978-3-319-25087-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics