Abstract
Microblogging (or tweet) has become a mainstream channel for people to share information with others on the Internet. Users are linked as a huge social network through tweets. Community recognition in tweet-based social network is very important for identifying users’ interests to help companies to improve their marketing strategies. However, because of being massive, involved in large fields, short-length and non-structural, it is difficult to process tweet messages with existing approaches straightforward. Due to this reason, in this work we present a framework DICH to Discover Implicit Communities Hidden in tweet data. To implement the framework, besides proposing techniques for preprocessing tweet data, we develop an unsupervised learning method called MbCLARANS, which is an optimized CLARANS algorithm, to discover the implicit communities hidden in tweet datasets. During the process of computing, the pairwise relationships between users are employed to improve the clustering quality. In addition, an adaptive k strategy is utilized to make the approach more applicable. The performance of the approach is demonstrated with experiments on tweet data collected from SINA Weibo.
Similar content being viewed by others
References
Amjad, A., Mona, D., Pradeep, D., Dragomir, R.: Subgroup detection in ideological discussions. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - vol. 1, pp. 399–409 (2012)
Benevenuto, F., Rodrigues, T., Cha, M., Almeida, V.: Characterizing user behavior in online social networks. In: Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement (IMC’09), pp. 49–62 (2009)
Boldi, P., Rosa, M., Santini, M., Vigna, S.: Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks. In: Proceedings of the 21st World Wide Web Conference (WWW’11), pp. 587–596 (2011)
Chavez, E., Navarro, G., Baeza-Yates, B.A., Marroquin, J.L.: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001)
Chen, K., Liu, S.: Word identification for Mandarin Chinese sentences. In: Proceedings of the 15th International Conference on Computational Linguistics (COLING’92), pp. 101–107 (1992)
Elson, D., Dames, N., McKeown, K.: Extracting social networks from literary fiction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics(ACL’10), pp. 138–147 (2010)
Falk, J.S.: Linguistics and Language: A survey of Basic Concepts and Implications. Wiley, New York (1978)
Hassan, A., Abu-Jbara, A., Radev, D.: Extracting signed social networks from text. In: Proceedings of the Text Graphs Workshop at ACL (TextGraphs-7), pp. 4–12 (2012)
Huberman, B., Robero, D.M., Wu, F.: Social networks that matter: twitter under the microscope. First Monday 14(1–5), 1–9 (2009)
Jain, A., Dubes, R.: Algorithms for Clustering Data, pp. 30–70. Prentice-Hall, Englewood Cliffs (1988)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Java, A., Song, X., Finin, T., Tseng, B.: Why we twitter: understanding microblogusage and communities. In: Proceedings of the 9th International Workshop on Knowledge Discovery on the Web(WebKDD2007), and the 1st International Workshop on Social Networks Analysis SNA-KDD’07, LNCS 5439, pp. 118138. Springer (2007)
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-Means clustering algorithm: analysis and implementation. IEEE Trans. Pattern. Anal. Mach. Intell. 27(7), 881–892 (2002)
Kaufman, L., Rousseeuw, P.: Clustering by means of Medoids, in statistical data analysis based on the L1-Norm and related methods. In: Dodge, Y. (ed.) pp. 405–416. North-Holland (1987)
Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis, pp 121–157. Wiley, New York (1990)
Lee, R., Wakamiya, S., Sumiya, K.: Discovery of unusual regional social activities using geo-tagged microblogs. World Wide Web: Internet Web Inf. Syst. 14(4), 321–349 (2011)
Lento, T., Welser, H.T., Gu, L., Smith, M.: The ties that blog: examining the relationship between socialites and continued participation in the wallop we blogging system. Retrieved from http://www.ra.ethz.ch/CDstore/www2006/www.blogpulse.com/www2006-workshop/papers/Lento-Welser-Gu-Smith-TiesThatBlog.pdf (2006)
Leskovec, J., Lang, K.J., Mahoney, W.: Empirical comparison of algorithms for network community detection. In: Proceedings of the 19th International World Wide Web Conference (WWW’10), pp. 631–640 (2010)
Li, H., Nie, Z., Lee, W.C., Giles, L., Wen, J.R.: Scalable community discovery on textual data with relations. In: Proceedings of the 17th ACM 17th Conference on Information and Knowledge Management (CIKM’08), pp. 1203–1212 (2008)
Lin, W., Kong, X., Yu, P.S., Wu, Q., Jia, Y., Li, C.: Community detection in incomplete information networks. Proc. of the 21st World Wide Web Conference (WWW’12), pp. 341–350. ACM Press (2012)
Lu, Y., Wang, H., Zhai, C., Roth, D.: Unsupervised discovery of opposing opinion networks from forum discussions. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’12), pp. 1642–1646 (2012)
McCallum, A., Wang, X., Corrada-Emmanuel, A.: Topic and role discovery in social networks with experiments on enron and academic Email. J. Artif. Intell. Res. 30, 249–272 (2007)
Michiel, H.: Euclidean space. Encyclopedia of Mathematics, Springer (2001)
Musial, K., Kazienko, P.: Social networks on the Internet. World Wide Web: Internet Web Inf. Syst. 16(1), 31–72 (2013)
Nardi, B.A., Schiano, D.J., Gumbrecht, M., Swartz, L.: Why we blog. Commun. ACM 47(12), 41–46 (2004)
Ng, R.T., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th Conference on Very Large Data Bases (VLDB’94), pp. 144–155. Morgan Kaufmann (1994)
Sachan, M., Contractor, D., Faruquie, T.A., Subramaniam, L.V.: Using content and interactions for discovering communities in social networks. In: Proceedings of the 21st World Wide Web Conference(WWW’12), pp. 331–340. ACM Press (2012)
Tversky, A.: Feature of similarty. Psychol. Rev. 84, 327–352 (1977)
Yan, X., Yan, L.: Gender classification of weblog authors. In: Proceedings of the 2006 AAAI Spring Symposium on Computation Approaches for Analyzing Weblogs(AAAI’06), Technical Report SS-06-03, pp. 228–230 (2006)
Zardi, H., Romdhane, L.B.: An O(n 2) algorithm for detecting communities of unbalanced sizes in large scale social networks. Knowl. Based Syst. 37, 19–36 (2013)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Peng, D., Lei, X. & Huang, T. DICH: A framework for discovering implicit communities hidden in tweets. World Wide Web 18, 795–818 (2015). https://doi.org/10.1007/s11280-014-0279-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-014-0279-z