Abstract
This paper analyzes what stylistic characteristics differentiate different styles of writing, and specifically types of different A-level computer science articles. To do so, we compared various full papers using stylistic feature sets and a supervised machine learning method. We report on the success of this approach in identifying papers from the last 6 years of the following three conferences: SIGIR, ACL, and AAMAS. This approach achieves high accuracy results of 95.86%, 97.04%, 93.22%, and 92.14% for the following four classification experiments: (1) SIGIR / ACL, (2) SIGIR / AAMAS, (3) ACL / AAMAS, and (4) SIGIR / ACL / AAMAS, respectively. The Part of Speech (PoS) and the Orthographic sets were superior to all others and have been found as key components in different types of writing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C.D.: An Evaluation of Naive Bayesian Anti-spam Filtering. CoRR, cs.CL/0006013 (2000)
Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing 17, 401–412 (2003)
Argamon, S., Koppel, M., Avneri, G.: Style-based Text Categorization: What Newspaper am I Reading? In: AAAI Workshop on Learning for Text (1998)
Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Mining the Blogosphere: Age, Gender and the Varieties of Self-expression. First Monday 12(9) (2007)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. In: Monterey, C.A. (ed.) Wadsworth & Brooks/Cole Advanced Books & Software (1984) ISBN 978-0-412-04841-8
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship Attribution with support vector machines. Applied Intelligence 19(1-2), 109–123 (2003)
Dikli, S.: An Overview of Automated Scoring of Essays. Journal of Technology, Learning, and Assessment 5(1), 1–35 (2006)
Egghe, L.: Untangling Herdan’s Law and Heaps’ Law: Mathematical and Informetric Arguments. Journal of the American Society for Information Science and Technology 58(5), 702–709 (2007)
Foltz, P.W.: Latent Semantic Analysis for Text-based Research. Behavior Research Methods, Instruments and Computers 28(2), 197–202 (1996)
HaCohen-Kerner, Y., Beck, H., Yehudai, E., Mughaz, D.: Stylistic Feature Sets as Classifiers of Documents According to their Historical Period and Ethnic Origin. Applied Artificial Intelligence 24(9), 847–862 (2010a)
HaCohen-Kerner, Y., Beck, H., Yehudai, E., Rosenstein, M., Mughaz, D.: Cuisine: Classification using Stylistic Feature Sets and/or Name-Based Feature Sets. JASIST 61(8), 1644–1657 (2010b)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: an Update. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009)
Hota, S.R., Argamon, S., Chung, R.: Gender in Shakespeare: Automatic Stylistics Gender Character Classification using Syntactic, Lexical and Lemma Features. In: Digital Humanties and Computer Science (DHCS) (2006)
Karlgren, J., Cutting, D.: Recognizing Text Genres with Simple Metrics using Discriminant Analysis. In: Proceedings of the 15th International Conference on Computational Linguistics, pp. 1071–1075 (1994)
Koppel, M., Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender. Lit. Linguist Computing 17(4), 401–412 (2002)
Koppel, M., Schler, J., Argamon, S.: Computational Methods in Authorship Attribution. JASIST 60(1), 9–26 (2009)
Koppel, M., Schler, J., Argamon, S.: Authorship Attribution in the Wild. Language Resources and Evaluation 45(1), 83–94 (2011)
Lemaire, B., Dessus, P.: A System to Assess the Semantic Content of Student Essays. Educational Computing Research 24(3), 305–306 (2001)
Lim, C., Lee, K., Kim, G.: Multiple Sets of Features for Automatic Genre Classification of Web Documents. Information Processing Management 41(5), 1263–1276 (2005)
Luyckx, K.: Scalability Issues in Authorship Attribution. Ph.D. Dissertation, Universiteit Antwerpen. University Press, Brussels (2010)
Meretakis, D., Wüthrich, B.: Extending Naive Bayes Classifiers using Long Itemsets. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 165–174. ACM (1999)
Novak, J., Raghavan, P., Tomkins, A.: Anti-aliasing on the Web. In: Proceedings of the 13th International Conference on World Wide Web (WWW), pp. 30–39. ACM (2004)
Pang, B., Lee, L.: Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 115–124. Association for Computational Linguistics (2005)
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: Sentiment Classification using Machine Learning Techniques. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), vol. 10, pp. 79–86 (2002)
Porter, M.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)
Rosenfeld, A., Zuckerman, I., Azaria, A., Kraus, S.: Combining Psychological Models with Machine Learning to Better Predict People’s Decisions. Synthese 189, 81–93 (2012)
Rokach, L., Maimon, O.: Data Mining with Decision Trees: Theory and Applications. World Scientific Pub. Co. Inc. (2008) ISBN 978-9812771711
Snyder, B., Barzilay, R.: Multiple Aspect Ranking using the Good Grief Algorithm. In: Proceedings of the HLT-NAACL, pp. 300–307 (2007)
Stamatatos, E., Kokkinakis, G., Fakotakis, N.: Automatic Text Categorization in Terms of Genre and Author. Comput. Linguist. 26(4), 471–495 (2000)
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based Authorship Attribution without Lexical Measures. Computers and the Humanities 35(2), 193–214 (2001)
Stamatatos, E.: Authorship Attribution based on Feature Set Subspacing Ensembles. International Journal on Artificial Intelligence Tools 15(5), 823–838 (2006)
Stamatatos, E.: Author identification: Using Text Sampling to Handle the Class Imbalance Problem. Inf. Process. Manage. 44(2), 790–799 (2008)
Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. Journal of the American Society for information Science and Technology 60(3), 538–556 (2009)
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich Part-of-speech Tagging with a Cyclic Dependency Network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL 2003), vol. 1, pp. 173–180. Association for Computational Linguistics (2003)
Tweedie, F.J., Baayen, R.H.: How Variable a Constant Be? Measures of Lexical Richness in Perspective. Computers and the Humanities 32(5), 323–352 (1998)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann (2005)
Yuan, Y., Shaw, M.J.: Induction of Fuzzy Decision Trees. Fuzzy Sets and Systems 69, 125–139 (1995)
Yule, U.: On Sentence Length as a Statistical Characteristic of Style in Prose with Application to Two Cases of Disputed Authorship. Biometrika 30, 363–390 (1938)
Zhang, L., Zhu, J., Yao, T.: An Evaluation of Statistical Spam Filtering Techniques. ACM Transactions on Asian Language Information Processing (TALIP) 3(4), 243–269 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
HaCohen-Kerner, Y., Rosenfeld, A., Tzidkani, M., Cohen, D.N. (2013). Classifying Papers from Different Computer Science Conferences. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds) Advanced Data Mining and Applications. ADMA 2013. Lecture Notes in Computer Science(), vol 8346. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53914-5_45
Download citation
DOI: https://doi.org/10.1007/978-3-642-53914-5_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53913-8
Online ISBN: 978-3-642-53914-5
eBook Packages: Computer ScienceComputer Science (R0)