Abstract
Word clouds have been proven as an effective tool for information access in different domains. As social media is a main driver of large increase in available user generated content, means for accessing information in such content are needed. We study word clouds as a means for information access in social media. Currently-used clouds that are generated from social media data include redundant and mis-ranked entries, harming their utility. We propose a method for generating improved word clouds over social streams. In this method, named entities are detected, disambiguated and aggregated into clusters, which in turn inform cloud construction. We show that word clouds using named entity clusters attain broader coverage and decreased content duplication. Further, an extrinsic evaluation shows improved access to data, with word clouds having grouped named entities being rated more relevant and diverse. Additionally we find word clouds with higher Mean Average Precision (MAP) tend to be more relevant to underlying concepts. Critically, this supports MAP as a tool for predicting cloud quality without needing a human.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
See www.textrazor.com.
- 2.
In these output excerpts, the columns are: bitstring; word; frequency in dataset.
- 3.
Validation by users of the third metric introduced in [4], Coverage, is only possible with an interactive user evaluation. Hence, we do not include “coverage assessment” of word clouds in this study.
References
Kuo, B.Y., Hentrich, T., Good, B.M., Wilkinson, M.D.: Tag clouds for summarizing web search results. In: Proceedings of the Conference on the World Wide Web (WWW), pp. 1203–1204. ACM (2007)
Miotto, R., Jiang, S., Weng, C.: eTACTS: a method for dynamically filtering clinical trial search results. J. Biomed. Inf. 46, 1060–1067 (2013)
Leginus, M., Dolog, P., Lage, R.: Graph based techniques for tag cloud generation. In: Proceedings of the ACM Conference on Hypertext and Social Media, pp. 148–157, ACM (2013)
Venetis, P., Koutrika, G., Garcia-Molina, H.: On the selection of tags for tag clouds. In: Proceedings of the Conference on Web Search and Data Mining (WSDM), pp. 835–844. ACM (2011)
Leginus, M., Zhai, C., Dolog, P.: Personalized generation of word clouds from tweets. J. Assoc. Inf. Sci. Technol. (2015)
Tufekci, Z.: Big questions for social media big data: representativeness, validity and other methodological pitfalls. In: Proceedings of the International Conference on Weblogs and Social Media (ICWSM), AAAI, pp. 505–514 (2014)
Bernstein, M.S., Suh, B., Hong, L., Chen, J., Kairam, S., Chi, E.H.: Eddi: interactive topic-based browsing of social status streams. In: Proceedings of the Annual Symposium on User Interface Software and Technology (UIST), pp. 303–312. ACM (2010)
Lage, R., Dolog, P., Leginus, M.: The role of adaptive elements in web-based surveillance system user interfaces. In: Dimitrova, V., Kuflik, T., Chin, D., Ricci, F., Dolog, P., Houben, G.-J. (eds.) UMAP 2014. LNCS, vol. 8538, pp. 350–362. Springer, Heidelberg (2014)
Rout, D., Bontcheva, K., Hepple, M.: Reliably evaluating summaries of Twitter timelines. In: Proceedings of the AAAI Workshop on Analyzing Microtext, AAAI, pp. 64–71 (2013)
Derczynski, L., Maynard, D., Aswani, N., Bontcheva, K.: Microblog-genre noise and impact on semantic annotation accuracy. In: Proceedings of the ACM Conference on Hypertext and Social Media, pp. 21–30. ACM (2013)
Maynard, D., Greenwood, M.A.: Who cares about sarcastic tweets? investigating the impact of sarcasm on sentiment analysis. In: Proceedings of the Conference on Language Resources and Evaluation (LREC), Reykjavik, Iceland, ELRA (2014)
Leginus, M., Derczynski, L., Dolog, P.: Enhanced information access to social streams through word clouds with entity grouping. In: Proceedings of the conference on Web Information Systems and Technologies (WEBIST) (2015)
Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., Dredze, M.: Annotating named entities in twitter data with crowdsourcing. In: Proceedings of the Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, ACL, pp. 80–88 (2010)
Hogan, A., Zimmermann, A., Umbrich, J., Polleres, A., Decker, S.: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora. Web Semant. Sci. Serv. Agents World Wide Web 10, 76–110 (2012)
Hu, Y., Talamadupula, K., Kambhampati, S., et al.: Dude, srsly?: The surprisingly formal nature of Twitter’s language. In: Proceedings of the International Conference on Weblogs and Social Media (ICWSM), AAAI (2013)
Baldwin, T., Cook, P., Lui, M., MacKinlay, A., Wang, L.: How noisy social media text, how diffrnt social media sources. In: Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), pp. 356–364 (2013)
Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M.A., Maynard, D., Aswani, N.: TwitIE: an open-source information extraction pipeline for microblog text. In: Proceedings of the conference on Recent Advances in Natural Language Processing (RANLP), pp. 83–90 (2013)
Baldwin, T., Kim, Y.B., de Marneffe, M.C., Ritter, A., Han, B., Xu, W.: Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. ACL-IJCNLP 2015, 126 (2015)
Han, B., Baldwin, T.: Lexical normalisation of short text messages: makn sens a #twitter. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), ACL, pp. 368–378 (2011)
Augenstein, I., Gentile, A.L., Norton, B., Zhang, Z., Ciravegna, F.: Mapping keywords to linked data resources for automatic query expansion. In: Proceedings of the Second International Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data, pp. 9–20 (2013)
Derczynski, L., Maynard, D., Rizzo, G., van Erp, M., Gorrell, G., Troncy, R., Petrak, J., Bontcheva, K.: Analysis of named entity recognition and linking for tweets. Inf. Process. Manag. 51, 32–49 (2015)
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the Meeting of the Special Interest Group on Management of Data (SIGMOD), pp. 1247–1250. ACM (2008)
Kergl, D., Roedler, R., Seeber, S.: On the endogenesis of Twitter’s Spritzer and Gardenhose sample streams. In: Proceedings of the Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 357–364. IEEE (2014)
Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18, 467–479 (1992)
Wittgenstein, L.: Philosophical Investigations. Basic Blackwell, London (1953)
Lesk, M.: Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In: Proceedings of the Annual International Conference on Systems Documentation (SIGDOC), pp. 24–26. ACM (1986)
Derczynski, L., Chester, S., Bøgh, K.S.: Tune your brown clustering, please. In: Proceedings of the Conference on Recent Advances in Natural Lang Processing (RANLP) (2015)
Lui, M., Baldwin, T.: langid.py: an off-the-shelf language identification tool. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), vol. 3, pp. 25–30. ACL (2012)
Wu, W., Zhang, B., Ostendorf, M.: Automatic generation of personalized annotation tags for Twitter users. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), ACL, pp. 689–692 (2010)
Ounis, I., Macdonald, C., Lin, J., Soboroff, I.: Overview of the TREC-2011 microblog track. In: Proceedings of the Text REtrieval Conference (TREC) (2011)
McCreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I., McCullough, D.: On building a reusable Twitter corpus. In: Proceedings of the meeting of the Special Interest Group in Information Retrieval (SIGIR), pp. 1113–1114. ACM (2012)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, vol. 1. Cambridge University Press, Cambridge (2008)
Mei, Q., Guo, J., Radev, D.: Divrank: the interplay of prestige and diversity in information networks. In: Proceedings of the meeting of the Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), pp. 1009–1018. ACM (2010)
Sabou, M., Bontcheva, K., Derczynski, L., Scharl, A.: Corpus annotation through crowdsourcing: towards best practice guidelines. In: Proceedings of the conference on Language Resources and Evaluation (LREC), ELRA (2014)
Acknowledgments
This work was partially supported by the European Union under grant agreement No. 611233 Pheme.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Leginus, M., Derczynski, L., Dolog, P. (2016). Entity Grouping for Accessing Social Streams via Word Clouds. In: Monfort, V., Krempels, KH., Majchrzak, T.A., Turk, Ž. (eds) Web Information Systems and Technologies. WEBIST 2015. Lecture Notes in Business Information Processing, vol 246. Springer, Cham. https://doi.org/10.1007/978-3-319-30996-5_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-30996-5_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30995-8
Online ISBN: 978-3-319-30996-5
eBook Packages: Computer ScienceComputer Science (R0)