Abstract
The computational analysis of the style of natural language texts, computational stylistics, seeks to develop automated methods to (1) effectively distinguish texts with one stylistic character from those of another, and (2) give a meaningful representation of the differences between textual styles. Such methods have many potential applications in areas including criminal and national security forensics, customer relations management, spam/scam filtering, and scholarly research. In this chapter, we propose a framework for research in computational stylistics, based on a functional model of the communicative act. We illustrate the utility of this framework via several case studies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We note that insofar as preferences for certain topics may be intimately related to other aspects of the communicative act that we consider within the purview of style, variation in content variables may be a legitimate and useful object of study. In particular, see the discussion in Sect. 5.4.
- 2.
Metamorphism refers to changes in mineral assemblage and texture in rocks that have been subjected to temperatures and pressures different from those under which they originally formed.
References
Androutsopoulos I, Koutsias J, Chandrinos K, Paliouras G, Spyropoulos C (2000) An evaluation of Naive Bayesian anti-spam filtering. In: Proceedings of the workshop on machine learning in the New Information Age, Barcelona.
Argamon S, Dodick J, Chase P (2008) Language use reflects scientific methodology: a corpus-based study of peer-reviewed journal articles. Scientometrics 75(2):203–238
Argamon S, Goulain J-B, Horton R, Olsen M (2009) Vive la difféerence! text mining gender difference in French literature. Digital Humanit Q 3(2). http://digitalhumanities.org/dhq/vol/3/2/
Argamon S, Koppel M, Avneri G (1998) Routing documents according to style. In: Proceedings of int’l workshop on innovative internet information systems, Pisa, Italy
Argamon S, Koppel M, Fine J, Shimony AR (2003) Gender, genre, and writing style in formal written texts. Text 23(3):321–346
Argamon S, Koppel M, Pennebaker JW, Schler J (2007) Mining the blogosphere: age, gender and the varieties of self-expression. First Monday, 12(9). http://firstmonday.org/issues/issue12_9/argamon/index.html
Argamon S, Olsen M (2006) Toward meaningful computing. Commun ACM 49(4):33–35
Argamon S, Šariéc M, Stein SS (2003) Style mining of electronic messages for multiple author discrimination. In: Proceedings of ACM conference on knowledge discovery and data mining
Argamon S, Whitelaw C, Chase P, Dhawle S, Garg N, Hota SR, Levitan S (2007) Stylistic text classification using functional lexical features. J Am Soc Inf Sci 58(6):802–822
Argamon S, Koppel M, Avneri G (1998) Routing documents according to style. In: First international workshop on innovative information systems, Pisa
Argamon S, Levitan S (2005) Measuring the usefulness of function words for authorship attribution. In: Proceedings of the 2005 ACH/ALLC conference, Victoria, BC, Jun 2005
Argamon-Engelson S, Koppel M, Avneri G (1998) Style-based text categorization: what newspaper am i reading? In: Proceedings of AAAI workshop on learning for text categorization, Madison, WI, pp 1–4
Austin JL (1976) How to do things with words. Oxford University Press, Oxford
Harald Baayen R, van Halteren H, Tweedie F (1996) Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Lit Linguist Comput 7:91–109
Baker VR (1996) The pragmatic routes of American quaternary geology and geomorphology. Geomorphology 16:197–215
Bean D, Riloff E (2004) Unsupervised learning of contextual role knowledge for coreference resolution. Proceedings of HLT/NAACL, Boston, MA, pp 297–304
Ben-David YL (2002) Shevet mi-Yehudah (in Hebrew). No publisher listed, Jerusalem
Berry MJ, Linoff G (1997) Data Mining techniques: for marketing, sales, and customer support. Wiley, New York, NY
Biber D (1995) Dimensions of register variation: a cross-linguistic comparison. Cambridge University Press, Cambridge
Bloom K, Garg N, Argamon S (2007) Extracting appraisal expressions. In: HLT/NAACL 2007, Rochester, NY, April 2007
Burrows J (2002) ‘Delta’: a measure of stylistic difference and a guide to likely authorship. Lit Linguis Comput 17(3):267–287
Burrows JF (1987) Computation into criticism: a study of Jane Austen’s novels and an experiment in method. Clarendon, Oxford
Butler CS (2003) Structure and function: a guide to three major structural-functional theories. John Benjamins, Amsterdam
Chaski CE (1999) Linguistic authentication and reliability. In: National conference on science and the law, National Institute of Justice, San Diego, CA
Cleland CE (2002) Methodological and epistemic differences between historical science and experimental science. Philos Sci 69(3):447–451
Coates J (2004) Women, men and language: a sociolinguistic account of gender differences in language. Pearson Education, New York, NY
Dagan I, Karov Y, Roth D (1997) Mistake-driven learning in text categorization. In: Cardie C, Weischedel R (eds) Proceedings of EMNLP-97, 2nd conference on empirical methods in natural language processing, Providence, US, 1997. Association for Computational Linguistics, Morristown, TN pp 55–63
D’Andrade RG (1995) The development of cognitive anthropology. Cambridge University Press, Cambridge
de Vel O (2000) Mining e-mail authorship. In: Workshop on text mining, ACM international conference on knowledge discovery and data mining, Boston, MA
de Vel O, Anderson A, Corney M, Mohay G (2001) Mining email content for author identification forensics. ACM SIGMOD Rec 30(4):55–64
de Vel O, Corney M, Anderson A, Mohay G (2002) Language and gender author cohort analysis of e-mail for computer forensics. In: Proceedings of digital forensic research workshop, Syracuse, NY
Diamond J (2002) Guns, germs and steel: the fates of human societies. W.W. Norton, New York, NY
Dimitrova M, Finn A, Kushmerick N, Smyth B (2002) Web genre visualization. In: Proceedings of the conference on human factors in computing systems, Minneapolis, MN
Fawcett RP (1980) Cognitive linguistics and social interaction: towards an integrated model of a systemic functional grammar and the other components of a communicating mind. John Benjamins, Amsterdam
Feiguina O, Hirst G (2007) Authorship attribution for small texts: literary and forensic experiments. In: Proceedings of the conference of the international association of forensic linguistics, Seattle, WA
Finn A, Kushmerick N, Smyth B (2002) Genre classification and domain transfer for information filtering. In: Crestani F, Girolami M, van Rijsbergen CJ (eds) Proceedings of ECIR-02, 24th European colloquium on information retrieval research, Glasgow, Springer, Heidelberg, DE
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mac Learn Res 3(7–8):1289–1305
Genkin A, Lewis DD, Madigan D (2006) Large-scale Bayesian logistic regression for text categorization. Technometrics 49(3):291–304
Gorsuch RL (1983) Factor analysis. L. Erlbaum, Hillsdale, NJ
Gould SJ (1986) Evolution and the triumph of homology, or, why history matters. Am Sci Jan.–Feb.:60–69
Graham N, Hirst G (2003) Segmenting a document by stylistic character. In: Workshop on computational approaches to style analysis and synthesis, 18th international joint conference on artificial intelligence, Acapulco
Gregory M (1967) Aspects of varieties differentiation. J Linguist 3:177–198
Gumperz JJ, Levinson SC (1996) Rethinking linguistic relativity. Cambridge University Press, Cambridge
Hacking I (2002) Historical ontology. Harvard University Press, Cambridge, MA
Halliday MAK, Hasan R (1976) Cohesion in English. Longman, London
Halliday MAK (1978) Language as social semiotic: the social interpretation of language and meaning. Edward Arnold, London
Halliday MAK (1994) Introduction to functional grammar, 2nd edn. Edward Arnold, London
Harris J (1989) The idea of community in the study of writing. Coll Compos Commun 40(1):11–22
Herring SC, Scheidt LA, Bonus S, Wright E (2004) Bridging the gap: a genre analysis of weblogs. In: Proceedings of the 37th Hawai’i international conference on system sciences (HICSS-37), IEEE Computer Society, Los Alamitos, CA
Heylighen F, Dewaele JM (2002) Variation in the contextuality of language: an empirical measure. Found Sci 7(3):293–340
Holmes DI (1998) The evolution of stylometry in humanities scholarship. Lit Linguis Comp 13(3):111–117
Holmes J, Meyerhoff M (2000) The community of practice: theories and methodologies in language and gender research. Lang Soc 28(02):173–183
Hoover D (2002) Frequent word sequences and statistical stylistics. Lit Linguis Comput 17:157–180
Joachims T (1999) Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A (eds) Advances in Kernel methods–-support vector learning. MIT, Cambridge, MA
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Proceedings of ECML-98, 10th European conference on machine learning, number 1398, Chemnitz, DE. Springer, Heidelberg, DE pp 137–142
Juola P (2008) Authorship attribution. Found trends Inf Retr 1(3):233–334
Karlgren J (2000) Stylistic experiments for information retrieval. PhD thesis, SICS
Kessler B, Nunberg G, Schütze H (1997) Automatic detection of text genre. In: Cohen PR, Wahlster W (eds) Proceedings of the 35 annual meeting of the association for computational linguistics and 8th conference of the European chapter of the association for computational linguistics, Association for Computational Linguistics, Somerset, NJ, pp 32–38
Kitcher P (1993) The advancement of science. Oxford University Press, New York, NY
Kjell B, Frieder O (1992) Visualization of literary style. In: IEEE international conference on systems, man and cybernetics, Chicago, IL, pp 656–661
Koppel M, Argamon S, Shimoni AR (2003) Automatically categorizing written texts by author gender. Lit Linguist Comput 17(4):401–412
Koppel M, Mughaz D, Schler J (2004) Text categorization for authorship verification. In: Proceedings of 8th Symposium on artificial intelligence and mathematics, Fort Lauderdale, FL
Koppel M, Schler J (2004) Authorship verification as a one-class classification problem. In: Proceedings of Int’l conference on machine learning, Banff, AB
Koppel M, Schler J, Argamon S (2008) Computational methods in authorship attribution. J Am Soc Inf Sci Technol 60(1):9–26
Koppel M, Akiva N, Dagan I (2003) A corpus-independent feature set for style-based text categorization. In: Workshop on computational approaches to style analysis and synthesis, 18th international joint conference on artificial intelligence, Acapulco
Kukushkina OV, Polikarpov AA, Khmelev DV (2001) Using literal and grammatical statistics for authorship attribution. Prob Inf Trans 37(2):172–184
Kushmerick N (1999) Learning to remove internet advertisement. In: Etzioni O, Müller JP, Bradshaw JM (eds) Proceedings of the 3rd international conference on autonomous agents (Agents’99), ACM Press, Seattle, WA, pp 175–181
Lang K (1995) NewsWeeder: learning to filter netnews. In: Proceedings of the 12th international conference on machine learning, Morgan Kaufmann, San Mateo, CA, pp 331–339
Lewis DD (1998) Naive (Bayes) at forty: the independence assumption in information retrieval. Proceedings of ECML-98, 10th European conference on machine Learning, 1998, Berlin, Springer, Heidelburg, pp 4–15
Littlestone N (1987) Learning when irrelevant attributes abound. In: Proceedings of the 28th annual symposium on foundations of computer science, October 1987, Los Angeles, CA, pp 68–77
Martin JR (1992) English text: system and structure. Benjamin’s, Amsterdam
Martin JR, White PRR (2005) The language of evaluation: appraisal in English. Palgrave, London
Mascol C (1888) Curves of Pauline and Pseudo-Pauline style I. Unitarian Rev 30:452–460
Mascol C (1888) Curves of Pauline and Pseudo-Pauline style II. Unitarian Rev 30:539–546
Matthews RAJ, Merriam TVN (1997) Distinguishing literary styles using neural networks, chapter 8. IOP publishing and Oxford University Press, Oxford
Matthiessen C (1995) Lexico-grammatical cartography: English systems. International Language Sciences Publishers, Tokyo
Mayr E (1976) Evolution and the diversity of life. Harvard University Press, Cambridge, MA
Mayr E (1985) How biology differs from the physical sciences. In: Evolution at the crossroads: the new biology and the new philosophy of science, MIT, Cambridge, pp 43–46
McCallum A, Nigam K (1998) A comparison of event models for Naive Bayes text classification. AAAI-98 workshop on learning for text categorization, 752, pp 41–48
McEnery A, Oakes M (2000) Authorship studies/textual statistics, Marcel Dekker, New York, NY, pp 234–248
McKinney V, Yoon K, Zahedi FM (2002) The measurement of web-customer satisfaction: an expectation and disconfirmation approach. Info Sys Res 13(3):296–315
McMenamin G (2002) Forensic linguistics: advances in forensic stylistics. CRC press
Mendenhall TC (1887) Characteristic curves of composition. Science 9(214s):237–246
Mosteller F, Wallace DL (1964) Inference and disputed authorship: the federalist. Series in behavioral science: quantitative methods edition. Addison-Wesley, Reading, MA
Mulac A, Lundell TL (1986) Linguistic contributors to the gender-linked language effect. J Lang Soc Psychol 5(2):81
Newman ML, Groom CJ, Handelman LD, Pennebaker JW (2008) Gender Differences in language use: an analysis of 14,000 text samples. Discourse Process 45(3):211–236
Ng V (2004) Learning noun phrase anaphoricity to improve coreference resolution: issues in representation and optimization. Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL), Barcelona, pp 152–159
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? sentiment classification using machine learning techniques. In: Proceedings of EMNLP conference on empirical methods in natural language processing, Philadelphia, PA, pp 79–86
Patrick J (2004) The scamseek project: text mining for financial scams on the internet. In: Simoff SJ, Williams GJ (eds) Proceedings of 3rd Australasian data mining conference, Carins, pp 33–38
Pennebaker JW, Mehl MR, Niederhoffer K (2003) Psychological aspects of natural language use: our words, our selves. Ann Rev Psychol 54:547–577
Platt J (1998) Sequential minimal optimization: a fast algorithm for training support vector machines. Microsoft research technical report MSR-TR-98-14, Redmond, WA
Rudman J (1997) The state of authorship attribution studies: some problems and solutions. Comput Human 31(4):351–365
Rudolph JL, Stewart J (1998) Evolution and the nature of science: on the historical discord and its implication for education. J Res Sci Teach 35:1069–1089
Searle JR (1989) Expression and meaning: studies in the theory of speech acts. Cambridge University Press, Cambridge
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1)
Stamatatos E, Fakotakis N, Kokkinakis GK (2000) Automatic text categorization in terms of genre, author. Comput Linguist 26(4):471–495
Swales JM (1990) Genre analysis. Cambridge University Press, Cambridge
Torvik VI, Weeber M, Swanson DR, Smalheiser NR (2005) A probabilistic similarity metric for Medline records: a model for author name disambiguation. J Am Soc Inf Sci Technol, 56(2):140–158
Turney PD (2002) Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In: Proceedings 40th annual meeting of the ACL (ACL’02), Philadelphia, PA, pp 417–424
Tweedie F, Singh S, Holmes D (1996) Neural network applications in stylometry: the federalist papers. Comput Human 30(1):1–10
Wenger E (1999) Communities of practice: learning, meaning, and identity. Cambridge University Press, Cambridge
Whewell W (1837) History of the inductive sciences. John W. Parker, London
Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retr 1(1):69–90
Yang Y, Pedersen JO (1997) A Comparative study on feature selection in text categorization. Proceedings of the 14th international conference on machine learning table of contents, Nashville, TN, pp 412–420
Yule GU (1994) Statistical study of literary vocabulary. Cambridge University Press, Cambridge
Yule GU (1938) On sentence length as a statistical characteristic of style in prose with application to two cases of disputed authorship. Biometrika 30:363–390
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Argamon, S., Koppel, M. (2010). The Rest of the Story: Finding Meaning in Stylistic Variation. In: Argamon, S., Burns, K., Dubnov, S. (eds) The Structure of Style. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12337-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-12337-5_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12336-8
Online ISBN: 978-3-642-12337-5
eBook Packages: Computer ScienceComputer Science (R0)