Skip to main content

Vector-Based Similarity Measurements for Historical Figures

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9371))

Abstract

Historical interpretation benefits from identifying analogies among famous people: Who are the Lincolns, Einsteins, Hitlers, and Mozarts? We investigate several approaches to convert approximately 600,000 historical figures into vector representations to quantify similarity according to their Wikipedia pages. We adopt an effective reference standard based on the number of human-annotated Wikipedia categories being shared and use this to demonstrate the performance of our similarity detection algorithms. In particular, we investigate four different unsupervised approaches to representing the semantic associations of individuals: (1) TF-IDF, (2) Weighted average of distributed word embedding, (3) LDA Topic analysis and (4) Deepwalk embedding from page links. All proved effective, but Deepwalk embedding yielded an overall accuracy of 91.33% in our evaluation to uncover historical analogies. Combining LDA and Deepwalk yielded even higher performance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Al-Rfou, R., Perozzi, B., Skiena, S.: Polyglot: Distributed word representations for multilingual NLP. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 183–192 (2013)

    Google Scholar 

  2. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. Journal of Machine Learning Research 3, 1137–1155 (2003)

    MATH  Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  4. Chen, Y., Perozzi, B., Al-Rfou, R., Skiena, S.: The expressive power of word embeddings. In: ICML 2013 Workshop on Deep Learning for Audio, Speech, and Language Processing (2013)

    Google Scholar 

  5. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. The Journal of Machine Learning Research 12, 2493–2537 (2011)

    MATH  Google Scholar 

  6. Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with mapreduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies (2008)

    Google Scholar 

  7. Fellbaum, C.: WordNet. In: Theory and Applications of Ontology: Computer Applications, pp. 231–243 (2010)

    Google Scholar 

  8. Huang, A.: Similarity measures for text document clustering. In: Proceedings of the Sixth New Zealand Computer Science Research Student Conference (2008)

    Google Scholar 

  9. Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (2012)

    Google Scholar 

  10. Kim, M., Zhang, B.T., Lee, J.S.: Subjective document classification using network analysis. In: 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 365–369. IEEE (2010)

    Google Scholar 

  11. Krestel, R., Fankhauser, P., Nejdl, W.: Latent dirichlet allocation for tag recommendation. In: Proceedings of the third ACM conference on Recommender Systems, pp. 61–68. ACM (2009)

    Google Scholar 

  12. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-2014), pp. 1188–1196 (2014)

    Google Scholar 

  13. Maiya, A.S., Rolfe, R.M.: Topic similarity networks: visual analytics for large document sets. In: 2014 IEEE International Conference on Big Data (Big Data), pp. 364–372. IEEE (2014)

    Google Scholar 

  14. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781

  15. Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701–710 (2014)

    Google Scholar 

  16. Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50 (2010)

    Google Scholar 

  17. Skiena, S., Ward, C.B.: Who’s Bigger?: Where Historical Figures Really Rank. Cambridge University Press (2013)

    Google Scholar 

  18. Wang, C., Yu, X., Li, Y., Zhai, C., Han, J.: Content coverage maximization on word networks for hierarchical topic summarization. In: Proceedings of the 22nd ACM international conference on Conference on Information & Knowledge Management, pp. 249–258. ACM (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Steven Skiena .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Chen, Y., Perozzi, B., Skiena, S. (2015). Vector-Based Similarity Measurements for Historical Figures. In: Amato, G., Connor, R., Falchi, F., Gennaro, C. (eds) Similarity Search and Applications. SISAP 2015. Lecture Notes in Computer Science(), vol 9371. Springer, Cham. https://doi.org/10.1007/978-3-319-25087-8_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25087-8_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25086-1

  • Online ISBN: 978-3-319-25087-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics