Two-layered Blogger identification model integrating profile and instance-based methods

Mohtasseb, Haytham; Ahmed, Amr

doi:10.1007/s10115-011-0398-0

Two-layered Blogger identification model integrating profile and instance-based methods

Regular Paper
Published: 20 April 2011

Volume 31, pages 1–21, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Haytham Mohtasseb¹ &
Amr Ahmed¹

185 Accesses
5 Citations
3 Altmetric
Explore all metrics

Abstract

This paper introduces a two-layered framework that improves the result of authorship identification within larger sample numbers of bloggers as compared with earlier work. Previous studies are mainly divided into two categories: profile-based and instance-based methods. Each of these approaches has its advantages and limitations. The two-layered framework presented here integrates the two previous approaches and presents a new solution to a key problem in authorship identification, namely the drop in accuracy experienced as the number of authors increases. The paper begins by illustrating the regular instance-based core model and the investigated features. It then introduces a new psycholinguistic profile representation of authors, presents similarity grouping extraction over profiles, and applies blogger identification utilizing the two-layered approach. The results confirm the improvement introduced by the proposed two-layered approach against our regular classifier, as well as a selected baseline, for an extended number of users.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abbasi A, Chen H (2008) Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans Inform syst 26(2): 1–29
Article Google Scholar
Argamon S, Koppel M, Pennebaker JW, Schler J (2009) Automatically profiling the author of an anonymous text. Commun ACM 52(2): 119–123
Article Google Scholar
Argamon S, Saric M, Stein SS (2003) Style mining of electronic messages for multiple authorship discrimination: First results. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, pp 475-480
Baker LD, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, ACM, pp 96–103
Chan S, Pon RK, Cardenas AF (2006) Visualization and clustering of author social networks. In: Distributed multimedia systems conference, pp 174–180. http://www.cs.ucla.edu/~cardenas/cardenas2.html
Dardick GS, Roche CRL, Flanigan MA (2007) Blogs: Anti-forensics and counter anti-forensics. In: Proceedings of the 5th Australian digital forensics conference, p 199
de Vel O, Anderson A, Corney M, Mohay G (2001) Mining e-mail content for author identification forensics ACM. SIGMOD Rec 30(4): 55–64
Article Google Scholar
Diederich J, Kindermann J, Leopold E, Paass G (2003) Authorship attribution with support vector machines. Appl Intell 19(1): 109–123
Article MATH Google Scholar
Feldman R, Sanger J (2007) The text mining handbook. Cambridge University Press, New York
Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3: 1289–1305
MATH Google Scholar
Frantzeskou G, Stamatatos E, Gritzalis S, Katsikas S (2006) Effective identification of source code authors using byte-level information. In: Proceedings of the 28th international conference on Software engineering, ACM, p 896
Gehrke GT, Reader S, Squire KM (2008) Authorship discovery in blogs using Bayesian classification with corrective scaling
Gill A (2003) Personality and language: The projection and perception of personality in computer-mediated communication
Gill AJ, French RM, Gergle D, Oberlander J (2008) The language of emotion in short blog texts. In: Proceedings of the ACM 2008 conference on computer supported cooperative work, ACM New York, pp 299–302
Hancock JT, Gee K, Ciaccio K, Lin JMH (2008) I’m sad you’re sad: emotional contagion in cmc, in ‘Proceedings of the ACM 2008 conference on computer supported cooperative work’, ACM New York, pp. 295–298
Hancock JT, Landrigan C, Silver C (2007) Expressing emotion in text-based communication. In: Proceedings of the SIGCHI conference on human factors in computing systems, ACM New York, pp 929–932
He Y, Hui SC, Fong ACM (2003) Citation-based retrieval for scholarly publications. IEEE Intell Syst 18(2): 58–65
Article Google Scholar
Holmes D, Forsyth R (1995) The Federalist revisited: new directions in authorship attribution. Lit Linguist Comput 10(2): 111
Article Google Scholar
Jing L, Ng MK, Huang JZ (2009) Knowledge-based vector space model for text clustering. Knowl Info Syst 25(1): 35–55
Article Google Scholar
Keselj V, Peng F, Cercone N, Thomas C (2003) N-gram-based author profiles for authorship attribution. In: Proceedings of the conference pacific association for computational linguistics, PACLING 3, Citeseer, pp 255–264
Koppel M, Akiva N, Dagan I (2006) Feature instability as a criterion for selecting potential style markers. J Am Soc Info Sci Technol 57(11): 1519–1525
Article Google Scholar
Koppel M, Schler J (2003) Exploiting stylistic idiosyncrasies for authorship attribution. In: Proceedings of IJCAI’03 workshop on computational approaches to style analysis and synthesis, pp 69–72
Koppel M, Schler J, Zigdon K (2005) Determining an author’s native language by mining a text for errors. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, ACM New York, pp 624–628
Kyriakopoulou A, Kalamboukis T (2006) Text classification using clustering. In: The Discovery challenge workshop, Citeseer, p 28
Li J, Zheng R, Chen H (2006) From fingerprint to writeprint. Commun ACM 49: 76–82
Article Google Scholar
Mairesse F, Walker MA, Mehl MR, Moore RK (2007) Using linguistic cues for the automatic recognition of personality in conversation and text. J Artif Intell Res 30: 457–500
MATH Google Scholar
Matthews RAJ, Merriam TVN (1993) Neural computation in stylometry. In: An application to the works of shakespeare and fletcher. Lit Linguist Comput 8(4): 203
Article Google Scholar
Mishne GA (2007) Applied text analytics for blogs. Universiteit van, Amsterdam
Google Scholar
Mohtasseb H, Ahmed A (2009a) Mining online diaries for blogger identification. In: The 2009 International conference of data mining and knowledge engineering (ICDMKE’09)
Mohtasseb H, Ahmed A (2009b) More blogging features for author identification. In: The 2009 International conference on knowledge discovery (ICKD’09)
Mohtasseb H, Ahmed A (2010) The affects of demographics differentiations on authorship identification. Springer, Netherlands, pp 409–417
Mosteller F, Wallace DL (1964) Inference and disputed authorship: the federalist. Addison-Wesley, Reading
MATH Google Scholar
Narasimhamurthy A, Greene D, Hurley N, Cunningham P (2009) Partitioning large networks without breaking communities. Knowl Info Syst 25(2): 1–25
Google Scholar
Nowson S, Oberlander J (2006) The identity of bloggers: openness and gender in personal weblogs. In: Proceedings of the AAAI spring symposia on computational approaches to analyzing weblogs
Peng F, Schuurmans D, Wang S (2004) Augmenting naive bayes classifiers with statistical language models. Info Retr 7(3): 317–345
Article Google Scholar
Pennebaker JW, Francis ME, Booth RJ (2001) Linguistic inquiry and word count: Liwc 2001. Lawrence Erlbaum Associates, Mahway
Google Scholar
Pennebaker JW, King LA (1999) Linguistic styles:Language use as an individual difference. J Pers Soc Psychol 77(6): 1296–1312
Article Google Scholar
Pfitzner D, Leibbrandt R, Powers D (2009) Characterization and evaluation of similarity measures for pairs of clusterings. Knowl Info Syst 19(3): 361–394
Article Google Scholar
Porter M.(n.d.) The porter stemming algorithm, Accessible at http://www.tartarus.org/martin/PorterStemmer
Raskutti B, Ferra HL, Kowalczyk A (2002) Using unlabelled data for text classification through addition of cluster parameters. In: Proceedings of the nineteenth international conference on machine learning, Morgan Kaufmann, p 521
Slonim N, Tishby N (2001) The power of word clusters for text classification. In: Proceedings of ECIR-01, 23rd European colloquium on information retrieval research, citeseer
Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Info Sci Technol 60(3): 538–556
Article Google Scholar
Tishby N, Pereira FC, Bialek W (2000) The information bottleneck method. Arxiv preprint physics/0004057
Uzuner O, Katz B (2005) A comparative study of language models for book and author recognition. Lect Notes Comput Sci 3651: 969
Article Google Scholar
Wan X (2008) Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Info Syst 15(1): 55–73
Article Google Scholar
Willard N, JD D (2005) Educator’s guide to cyberbullying addressing the harm caused by online social cruelty. Accessible at http://cyberbullying.org 19, 2005
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco
MATH Google Scholar
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS (2008) Top 10 algorithms in data mining. Knowl Info Syst 14(1): 1–37
Article Google Scholar
Zhao Y, Zobel J (2005) Effective and scalable authorship attribution using function words. Lect Notes Comput Sci 3689: 174–189
Article Google Scholar
Zheng R, Li J, Chen H, Huang Z (2006) A framework for authorship identification of online messages: writing-style features and classification techniques. J Am Soc Info Sci Technol 57(3): 378–393
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, University of Lincoln, Lincoln, LN6 7TS, UK
Haytham Mohtasseb & Amr Ahmed

Authors

Haytham Mohtasseb
View author publications
You can also search for this author in PubMed Google Scholar
Amr Ahmed
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haytham Mohtasseb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mohtasseb, H., Ahmed, A. Two-layered Blogger identification model integrating profile and instance-based methods. Knowl Inf Syst 31, 1–21 (2012). https://doi.org/10.1007/s10115-011-0398-0

Download citation

Received: 25 January 2010
Revised: 19 November 2010
Accepted: 30 January 2011
Published: 20 April 2011
Issue Date: April 2012
DOI: https://doi.org/10.1007/s10115-011-0398-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-layered Blogger identification model integrating profile and instance-based methods

Abstract

Access this article

Similar content being viewed by others

Large-Scale Micro-Blog Authorship Attribution: Beyond Simple Feature Engineering

Unsupervised Author Identification and Characterization

Open-Set Classification for Automated Genre Identification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two-layered Blogger identification model integrating profile and instance-based methods

Abstract

Access this article

Similar content being viewed by others

Large-Scale Micro-Blog Authorship Attribution: Beyond Simple Feature Engineering

Unsupervised Author Identification and Characterization

Open-Set Classification for Automated Genre Identification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation