Abstract
In this study, lexical analysis is applied to the log data of conversations on Internet forums. It is well known that many regularities in documents have been found, for example, Zipf’s law and Heaps’ law. This type of analysis has been applied to documents in various media. However, few studies apply this analysis to documents that have been developed by many authors, for example, the log data of conversations on Internet forums. Usually, the relationship between document size and these regularities is not important, because the size of such documents is determined by its author, which is normally only a single person. However, the size of the communication log of an Internet forum is an emergent property for people who are interested in the forum. We believe that it is important to understand the dynamics of conversations.
Owing to the investigation in this study, the following trend has been found: the number of posted messages is small if the vocabulary growth parameter β of Heaps’ law is not within preferred range. Additionally, this study propose a new explanation based on the multiple author environment to understand the differences of this parameter β. Traditionally, such documents written by more than 1 person, for example, web sites and programming language, are analyzed from the single author point of view. This traditional approach is very important but not sufficient because this approach cannot discuss differences of vocabulary of each of the authors.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Cattuto, C., Baldassarri, A., Servedio, V.D.P., Loreto, V.: Vocabulary growth in collaborative tagging systems, arXiv:0704.3316v1 (2007)
Cattuto, C., Loreto, V., Pietronero, L.: Semiotic dynamics and collaborative tagging. Proceedings of the National Academy of Sciences 104(5), 1461–1464 (2007)
Gruhl, D., Guha, R., Liben-Nowell, D., Tomkins, A.: Information Diffusion Through Blogspace. In: Proceedings of the 13th International Conference on World Wide Web, pp. 491–501 (2004)
Kubo, M., Naruse, K., Sato, H., Matsubara, T.: Population estimation of internet forum community by posted article distribution. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010. LNCS, vol. 6279, pp. 298–307. Springer, Heidelberg (2010)
Naruse, K., Kubo, M.: Lognormal Distribution of BBS Articles and its Social and Generative Mechanism. In: 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), pp. 103–112 (2006)
Kubo, M., Naruse, K., Sato, H., Matubara, T.: The possibility of an epidemic meme analogy for web community population analysis. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 1073–1080. Springer, Heidelberg (2007)
Manning, C.D., Raghavan, P., Schëtze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Zhang, H.: Discovering power laws in computer programs. Information Processing and Management 45, 477–483 (2009)
Li, W.: Random Texts Exhibit Zipf’s-Law-Like Word Frequency Distribution. IEEE Transactions on Information Theory 38(6), 1842–1845 (1992)
Baldi, P., Frasconi, P., Smyth, P.: Modeling the Internet and the Web-Probabilistic Methods and Algorithms. Wiley, Chichester (2003)
van Leijenhorst, D.C., van der Weide, T.P.: A formal derivation of Heaps’ Law. Information Sciences 170, 263–272 (2005)
Lu, L., Zhang, Z.-K., Zhou, T.: Zipf’s Law Leads to Heaps’ Law: Analyzing Their Relation in Finite-Size Systems. arXiv:1002.3861v2 (2010)
\(\dot{A}\)ngeles Serrano, M., Flammini, A., Menczer, F.: Beyond Zipf’s law: Modeling the structure of human language (2009), http://arxiv.org/pdf/0902.0606
Chi, E.H., Mytkowicz, T.: Understanding the efficiency of social tagging systems using information theory. In: Proceedings of the 19th ACM Conference on Hypertext and Hypermedia, June 19-21, pp. 81–88. ACM, Pittsburgh (2008)
Fischler, M.A., Bolles, R.C.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Comm. of the ACM 24, 381–395 (1981)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kubo, M., Sato, H., Matsubara, T. (2011). Word Familiarity Distributions to Understand Heaps’ Law of Vocabulary Growth of the Internet Forums. In: König, A., Dengel, A., Hinkelmann, K., Kise, K., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems. KES 2011. Lecture Notes in Computer Science(), vol 6883. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23854-3_66
Download citation
DOI: https://doi.org/10.1007/978-3-642-23854-3_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23853-6
Online ISBN: 978-3-642-23854-3
eBook Packages: Computer ScienceComputer Science (R0)