Skip to main content

Word Familiarity Distributions to Understand Heaps’ Law of Vocabulary Growth of the Internet Forums

  • Conference paper
  • 2075 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6883))

Abstract

In this study, lexical analysis is applied to the log data of conversations on Internet forums. It is well known that many regularities in documents have been found, for example, Zipf’s law and Heaps’ law. This type of analysis has been applied to documents in various media. However, few studies apply this analysis to documents that have been developed by many authors, for example, the log data of conversations on Internet forums. Usually, the relationship between document size and these regularities is not important, because the size of such documents is determined by its author, which is normally only a single person. However, the size of the communication log of an Internet forum is an emergent property for people who are interested in the forum. We believe that it is important to understand the dynamics of conversations.

Owing to the investigation in this study, the following trend has been found: the number of posted messages is small if the vocabulary growth parameter β of Heaps’ law is not within preferred range. Additionally, this study propose a new explanation based on the multiple author environment to understand the differences of this parameter β. Traditionally, such documents written by more than 1 person, for example, web sites and programming language, are analyzed from the single author point of view. This traditional approach is very important but not sufficient because this approach cannot discuss differences of vocabulary of each of the authors.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. http://en.wikipedia.org/wiki/2channel

  2. http://discussions.apple.com/index.jspa

  3. Cattuto, C., Baldassarri, A., Servedio, V.D.P., Loreto, V.: Vocabulary growth in collaborative tagging systems, arXiv:0704.3316v1 (2007)

    Google Scholar 

  4. Cattuto, C., Loreto, V., Pietronero, L.: Semiotic dynamics and collaborative tagging. Proceedings of the National Academy of Sciences 104(5), 1461–1464 (2007)

    Article  Google Scholar 

  5. Gruhl, D., Guha, R., Liben-Nowell, D., Tomkins, A.: Information Diffusion Through Blogspace. In: Proceedings of the 13th International Conference on World Wide Web, pp. 491–501 (2004)

    Google Scholar 

  6. Kubo, M., Naruse, K., Sato, H., Matsubara, T.: Population estimation of internet forum community by posted article distribution. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010. LNCS, vol. 6279, pp. 298–307. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  7. Naruse, K., Kubo, M.: Lognormal Distribution of BBS Articles and its Social and Generative Mechanism. In: 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), pp. 103–112 (2006)

    Google Scholar 

  8. Kubo, M., Naruse, K., Sato, H., Matubara, T.: The possibility of an epidemic meme analogy for web community population analysis. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 1073–1080. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  9. http://en.wikipedia.org/wiki/Heaps%27_law

  10. Manning, C.D., Raghavan, P., Schëtze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  11. Zhang, H.: Discovering power laws in computer programs. Information Processing and Management 45, 477–483 (2009)

    Article  Google Scholar 

  12. Li, W.: Random Texts Exhibit Zipf’s-Law-Like Word Frequency Distribution. IEEE Transactions on Information Theory 38(6), 1842–1845 (1992)

    Article  Google Scholar 

  13. Baldi, P., Frasconi, P., Smyth, P.: Modeling the Internet and the Web-Probabilistic Methods and Algorithms. Wiley, Chichester (2003)

    Google Scholar 

  14. van Leijenhorst, D.C., van der Weide, T.P.: A formal derivation of Heaps’ Law. Information Sciences 170, 263–272 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  15. Lu, L., Zhang, Z.-K., Zhou, T.: Zipf’s Law Leads to Heaps’ Law: Analyzing Their Relation in Finite-Size Systems. arXiv:1002.3861v2 (2010)

    Google Scholar 

  16. \(\dot{A}\)ngeles Serrano, M., Flammini, A., Menczer, F.: Beyond Zipf’s law: Modeling the structure of human language (2009), http://arxiv.org/pdf/0902.0606

  17. Chi, E.H., Mytkowicz, T.: Understanding the efficiency of social tagging systems using information theory. In: Proceedings of the 19th ACM Conference on Hypertext and Hypermedia, June 19-21, pp. 81–88. ACM, Pittsburgh (2008)

    Chapter  Google Scholar 

  18. Fischler, M.A., Bolles, R.C.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Comm. of the ACM 24, 381–395 (1981)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kubo, M., Sato, H., Matsubara, T. (2011). Word Familiarity Distributions to Understand Heaps’ Law of Vocabulary Growth of the Internet Forums. In: König, A., Dengel, A., Hinkelmann, K., Kise, K., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems. KES 2011. Lecture Notes in Computer Science(), vol 6883. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23854-3_66

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23854-3_66

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23853-6

  • Online ISBN: 978-3-642-23854-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics