Abstract
This paper introduces a two-layered framework that improves the result of authorship identification within larger sample numbers of bloggers as compared with earlier work. Previous studies are mainly divided into two categories: profile-based and instance-based methods. Each of these approaches has its advantages and limitations. The two-layered framework presented here integrates the two previous approaches and presents a new solution to a key problem in authorship identification, namely the drop in accuracy experienced as the number of authors increases. The paper begins by illustrating the regular instance-based core model and the investigated features. It then introduces a new psycholinguistic profile representation of authors, presents similarity grouping extraction over profiles, and applies blogger identification utilizing the two-layered approach. The results confirm the improvement introduced by the proposed two-layered approach against our regular classifier, as well as a selected baseline, for an extended number of users.
Similar content being viewed by others
References
Abbasi A, Chen H (2008) Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans Inform syst 26(2): 1–29
Argamon S, Koppel M, Pennebaker JW, Schler J (2009) Automatically profiling the author of an anonymous text. Commun ACM 52(2): 119–123
Argamon S, Saric M, Stein SS (2003) Style mining of electronic messages for multiple authorship discrimination: First results. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, pp 475-480
Baker LD, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, ACM, pp 96–103
Chan S, Pon RK, Cardenas AF (2006) Visualization and clustering of author social networks. In: Distributed multimedia systems conference, pp 174–180. http://www.cs.ucla.edu/~cardenas/cardenas2.html
Dardick GS, Roche CRL, Flanigan MA (2007) Blogs: Anti-forensics and counter anti-forensics. In: Proceedings of the 5th Australian digital forensics conference, p 199
de Vel O, Anderson A, Corney M, Mohay G (2001) Mining e-mail content for author identification forensics ACM. SIGMOD Rec 30(4): 55–64
Diederich J, Kindermann J, Leopold E, Paass G (2003) Authorship attribution with support vector machines. Appl Intell 19(1): 109–123
Feldman R, Sanger J (2007) The text mining handbook. Cambridge University Press, New York
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3: 1289–1305
Frantzeskou G, Stamatatos E, Gritzalis S, Katsikas S (2006) Effective identification of source code authors using byte-level information. In: Proceedings of the 28th international conference on Software engineering, ACM, p 896
Gehrke GT, Reader S, Squire KM (2008) Authorship discovery in blogs using Bayesian classification with corrective scaling
Gill A (2003) Personality and language: The projection and perception of personality in computer-mediated communication
Gill AJ, French RM, Gergle D, Oberlander J (2008) The language of emotion in short blog texts. In: Proceedings of the ACM 2008 conference on computer supported cooperative work, ACM New York, pp 299–302
Hancock JT, Gee K, Ciaccio K, Lin JMH (2008) I’m sad you’re sad: emotional contagion in cmc, in ‘Proceedings of the ACM 2008 conference on computer supported cooperative work’, ACM New York, pp. 295–298
Hancock JT, Landrigan C, Silver C (2007) Expressing emotion in text-based communication. In: Proceedings of the SIGCHI conference on human factors in computing systems, ACM New York, pp 929–932
He Y, Hui SC, Fong ACM (2003) Citation-based retrieval for scholarly publications. IEEE Intell Syst 18(2): 58–65
Holmes D, Forsyth R (1995) The Federalist revisited: new directions in authorship attribution. Lit Linguist Comput 10(2): 111
Jing L, Ng MK, Huang JZ (2009) Knowledge-based vector space model for text clustering. Knowl Info Syst 25(1): 35–55
Keselj V, Peng F, Cercone N, Thomas C (2003) N-gram-based author profiles for authorship attribution. In: Proceedings of the conference pacific association for computational linguistics, PACLING 3, Citeseer, pp 255–264
Koppel M, Akiva N, Dagan I (2006) Feature instability as a criterion for selecting potential style markers. J Am Soc Info Sci Technol 57(11): 1519–1525
Koppel M, Schler J (2003) Exploiting stylistic idiosyncrasies for authorship attribution. In: Proceedings of IJCAI’03 workshop on computational approaches to style analysis and synthesis, pp 69–72
Koppel M, Schler J, Zigdon K (2005) Determining an author’s native language by mining a text for errors. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, ACM New York, pp 624–628
Kyriakopoulou A, Kalamboukis T (2006) Text classification using clustering. In: The Discovery challenge workshop, Citeseer, p 28
Li J, Zheng R, Chen H (2006) From fingerprint to writeprint. Commun ACM 49: 76–82
Mairesse F, Walker MA, Mehl MR, Moore RK (2007) Using linguistic cues for the automatic recognition of personality in conversation and text. J Artif Intell Res 30: 457–500
Matthews RAJ, Merriam TVN (1993) Neural computation in stylometry. In: An application to the works of shakespeare and fletcher. Lit Linguist Comput 8(4): 203
Mishne GA (2007) Applied text analytics for blogs. Universiteit van, Amsterdam
Mohtasseb H, Ahmed A (2009a) Mining online diaries for blogger identification. In: The 2009 International conference of data mining and knowledge engineering (ICDMKE’09)
Mohtasseb H, Ahmed A (2009b) More blogging features for author identification. In: The 2009 International conference on knowledge discovery (ICKD’09)
Mohtasseb H, Ahmed A (2010) The affects of demographics differentiations on authorship identification. Springer, Netherlands, pp 409–417
Mosteller F, Wallace DL (1964) Inference and disputed authorship: the federalist. Addison-Wesley, Reading
Narasimhamurthy A, Greene D, Hurley N, Cunningham P (2009) Partitioning large networks without breaking communities. Knowl Info Syst 25(2): 1–25
Nowson S, Oberlander J (2006) The identity of bloggers: openness and gender in personal weblogs. In: Proceedings of the AAAI spring symposia on computational approaches to analyzing weblogs
Peng F, Schuurmans D, Wang S (2004) Augmenting naive bayes classifiers with statistical language models. Info Retr 7(3): 317–345
Pennebaker JW, Francis ME, Booth RJ (2001) Linguistic inquiry and word count: Liwc 2001. Lawrence Erlbaum Associates, Mahway
Pennebaker JW, King LA (1999) Linguistic styles:Language use as an individual difference. J Pers Soc Psychol 77(6): 1296–1312
Pfitzner D, Leibbrandt R, Powers D (2009) Characterization and evaluation of similarity measures for pairs of clusterings. Knowl Info Syst 19(3): 361–394
Porter M.(n.d.) The porter stemming algorithm, Accessible at http://www.tartarus.org/martin/PorterStemmer
Raskutti B, Ferra HL, Kowalczyk A (2002) Using unlabelled data for text classification through addition of cluster parameters. In: Proceedings of the nineteenth international conference on machine learning, Morgan Kaufmann, p 521
Slonim N, Tishby N (2001) The power of word clusters for text classification. In: Proceedings of ECIR-01, 23rd European colloquium on information retrieval research, citeseer
Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Info Sci Technol 60(3): 538–556
Tishby N, Pereira FC, Bialek W (2000) The information bottleneck method. Arxiv preprint physics/0004057
Uzuner O, Katz B (2005) A comparative study of language models for book and author recognition. Lect Notes Comput Sci 3651: 969
Wan X (2008) Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Info Syst 15(1): 55–73
Willard N, JD D (2005) Educator’s guide to cyberbullying addressing the harm caused by online social cruelty. Accessible at http://cyberbullying.org 19, 2005
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS (2008) Top 10 algorithms in data mining. Knowl Info Syst 14(1): 1–37
Zhao Y, Zobel J (2005) Effective and scalable authorship attribution using function words. Lect Notes Comput Sci 3689: 174–189
Zheng R, Li J, Chen H, Huang Z (2006) A framework for authorship identification of online messages: writing-style features and classification techniques. J Am Soc Info Sci Technol 57(3): 378–393
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mohtasseb, H., Ahmed, A. Two-layered Blogger identification model integrating profile and instance-based methods. Knowl Inf Syst 31, 1–21 (2012). https://doi.org/10.1007/s10115-011-0398-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0398-0