Skip to main content
Log in

Two-layered Blogger identification model integrating profile and instance-based methods

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This paper introduces a two-layered framework that improves the result of authorship identification within larger sample numbers of bloggers as compared with earlier work. Previous studies are mainly divided into two categories: profile-based and instance-based methods. Each of these approaches has its advantages and limitations. The two-layered framework presented here integrates the two previous approaches and presents a new solution to a key problem in authorship identification, namely the drop in accuracy experienced as the number of authors increases. The paper begins by illustrating the regular instance-based core model and the investigated features. It then introduces a new psycholinguistic profile representation of authors, presents similarity grouping extraction over profiles, and applies blogger identification utilizing the two-layered approach. The results confirm the improvement introduced by the proposed two-layered approach against our regular classifier, as well as a selected baseline, for an extended number of users.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Abbasi A, Chen H (2008) Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans Inform syst 26(2): 1–29

    Article  Google Scholar 

  2. Argamon S, Koppel M, Pennebaker JW, Schler J (2009) Automatically profiling the author of an anonymous text. Commun ACM 52(2): 119–123

    Article  Google Scholar 

  3. Argamon S, Saric M, Stein SS (2003) Style mining of electronic messages for multiple authorship discrimination: First results. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, pp 475-480

  4. Baker LD, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, ACM, pp 96–103

  5. Chan S, Pon RK, Cardenas AF (2006) Visualization and clustering of author social networks. In: Distributed multimedia systems conference, pp 174–180. http://www.cs.ucla.edu/~cardenas/cardenas2.html

  6. Dardick GS, Roche CRL, Flanigan MA (2007) Blogs: Anti-forensics and counter anti-forensics. In: Proceedings of the 5th Australian digital forensics conference, p 199

  7. de Vel O, Anderson A, Corney M, Mohay G (2001) Mining e-mail content for author identification forensics ACM. SIGMOD Rec 30(4): 55–64

    Article  Google Scholar 

  8. Diederich J, Kindermann J, Leopold E, Paass G (2003) Authorship attribution with support vector machines. Appl Intell 19(1): 109–123

    Article  MATH  Google Scholar 

  9. Feldman R, Sanger J (2007) The text mining handbook. Cambridge University Press, New York

    Google Scholar 

  10. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3: 1289–1305

    MATH  Google Scholar 

  11. Frantzeskou G, Stamatatos E, Gritzalis S, Katsikas S (2006) Effective identification of source code authors using byte-level information. In: Proceedings of the 28th international conference on Software engineering, ACM, p 896

  12. Gehrke GT, Reader S, Squire KM (2008) Authorship discovery in blogs using Bayesian classification with corrective scaling

  13. Gill A (2003) Personality and language: The projection and perception of personality in computer-mediated communication

  14. Gill AJ, French RM, Gergle D, Oberlander J (2008) The language of emotion in short blog texts. In: Proceedings of the ACM 2008 conference on computer supported cooperative work, ACM New York, pp 299–302

  15. Hancock JT, Gee K, Ciaccio K, Lin JMH (2008) I’m sad you’re sad: emotional contagion in cmc, in ‘Proceedings of the ACM 2008 conference on computer supported cooperative work’, ACM New York, pp. 295–298

  16. Hancock JT, Landrigan C, Silver C (2007) Expressing emotion in text-based communication. In: Proceedings of the SIGCHI conference on human factors in computing systems, ACM New York, pp 929–932

  17. He Y, Hui SC, Fong ACM (2003) Citation-based retrieval for scholarly publications. IEEE Intell Syst 18(2): 58–65

    Article  Google Scholar 

  18. Holmes D, Forsyth R (1995) The Federalist revisited: new directions in authorship attribution. Lit Linguist Comput 10(2): 111

    Article  Google Scholar 

  19. Jing L, Ng MK, Huang JZ (2009) Knowledge-based vector space model for text clustering. Knowl Info Syst 25(1): 35–55

    Article  Google Scholar 

  20. Keselj V, Peng F, Cercone N, Thomas C (2003) N-gram-based author profiles for authorship attribution. In: Proceedings of the conference pacific association for computational linguistics, PACLING 3, Citeseer, pp 255–264

  21. Koppel M, Akiva N, Dagan I (2006) Feature instability as a criterion for selecting potential style markers. J Am Soc Info Sci Technol 57(11): 1519–1525

    Article  Google Scholar 

  22. Koppel M, Schler J (2003) Exploiting stylistic idiosyncrasies for authorship attribution. In: Proceedings of IJCAI’03 workshop on computational approaches to style analysis and synthesis, pp 69–72

  23. Koppel M, Schler J, Zigdon K (2005) Determining an author’s native language by mining a text for errors. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, ACM New York, pp 624–628

  24. Kyriakopoulou A, Kalamboukis T (2006) Text classification using clustering. In: The Discovery challenge workshop, Citeseer, p 28

  25. Li J, Zheng R, Chen H (2006) From fingerprint to writeprint. Commun ACM 49: 76–82

    Article  Google Scholar 

  26. Mairesse F, Walker MA, Mehl MR, Moore RK (2007) Using linguistic cues for the automatic recognition of personality in conversation and text. J Artif Intell Res 30: 457–500

    MATH  Google Scholar 

  27. Matthews RAJ, Merriam TVN (1993) Neural computation in stylometry. In: An application to the works of shakespeare and fletcher. Lit Linguist Comput 8(4): 203

    Article  Google Scholar 

  28. Mishne GA (2007) Applied text analytics for blogs. Universiteit van, Amsterdam

    Google Scholar 

  29. Mohtasseb H, Ahmed A (2009a) Mining online diaries for blogger identification. In: The 2009 International conference of data mining and knowledge engineering (ICDMKE’09)

  30. Mohtasseb H, Ahmed A (2009b) More blogging features for author identification. In: The 2009 International conference on knowledge discovery (ICKD’09)

  31. Mohtasseb H, Ahmed A (2010) The affects of demographics differentiations on authorship identification. Springer, Netherlands, pp 409–417

  32. Mosteller F, Wallace DL (1964) Inference and disputed authorship: the federalist. Addison-Wesley, Reading

    MATH  Google Scholar 

  33. Narasimhamurthy A, Greene D, Hurley N, Cunningham P (2009) Partitioning large networks without breaking communities. Knowl Info Syst 25(2): 1–25

    Google Scholar 

  34. Nowson S, Oberlander J (2006) The identity of bloggers: openness and gender in personal weblogs. In: Proceedings of the AAAI spring symposia on computational approaches to analyzing weblogs

  35. Peng F, Schuurmans D, Wang S (2004) Augmenting naive bayes classifiers with statistical language models. Info Retr 7(3): 317–345

    Article  Google Scholar 

  36. Pennebaker JW, Francis ME, Booth RJ (2001) Linguistic inquiry and word count: Liwc 2001. Lawrence Erlbaum Associates, Mahway

    Google Scholar 

  37. Pennebaker JW, King LA (1999) Linguistic styles:Language use as an individual difference. J Pers Soc Psychol 77(6): 1296–1312

    Article  Google Scholar 

  38. Pfitzner D, Leibbrandt R, Powers D (2009) Characterization and evaluation of similarity measures for pairs of clusterings. Knowl Info Syst 19(3): 361–394

    Article  Google Scholar 

  39. Porter M.(n.d.) The porter stemming algorithm, Accessible at http://www.tartarus.org/martin/PorterStemmer

  40. Raskutti B, Ferra HL, Kowalczyk A (2002) Using unlabelled data for text classification through addition of cluster parameters. In: Proceedings of the nineteenth international conference on machine learning, Morgan Kaufmann, p 521

  41. Slonim N, Tishby N (2001) The power of word clusters for text classification. In: Proceedings of ECIR-01, 23rd European colloquium on information retrieval research, citeseer

  42. Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Info Sci Technol 60(3): 538–556

    Article  Google Scholar 

  43. Tishby N, Pereira FC, Bialek W (2000) The information bottleneck method. Arxiv preprint physics/0004057

  44. Uzuner O, Katz B (2005) A comparative study of language models for book and author recognition. Lect Notes Comput Sci 3651: 969

    Article  Google Scholar 

  45. Wan X (2008) Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Info Syst 15(1): 55–73

    Article  Google Scholar 

  46. Willard N, JD D (2005) Educator’s guide to cyberbullying addressing the harm caused by online social cruelty. Accessible at http://cyberbullying.org 19, 2005

  47. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco

    MATH  Google Scholar 

  48. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS (2008) Top 10 algorithms in data mining. Knowl Info Syst 14(1): 1–37

    Article  Google Scholar 

  49. Zhao Y, Zobel J (2005) Effective and scalable authorship attribution using function words. Lect Notes Comput Sci 3689: 174–189

    Article  Google Scholar 

  50. Zheng R, Li J, Chen H, Huang Z (2006) A framework for authorship identification of online messages: writing-style features and classification techniques. J Am Soc Info Sci Technol 57(3): 378–393

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haytham Mohtasseb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mohtasseb, H., Ahmed, A. Two-layered Blogger identification model integrating profile and instance-based methods. Knowl Inf Syst 31, 1–21 (2012). https://doi.org/10.1007/s10115-011-0398-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0398-0

Keywords

Navigation