Intrinsic plagiarism analysis

Stein, Benno; Lipka, Nedim; Prettenhofer, Peter

doi:10.1007/s10579-010-9115-y

Intrinsic plagiarism analysis

Published: 20 January 2010

Volume 45, pages 63–82, (2011)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Benno Stein¹,
Nedim Lipka¹ &
Peter Prettenhofer¹

1721 Accesses
78 Citations
3 Altmetric
Explore all metrics

Abstract

Research in automatic text plagiarism detection focuses on algorithms that compare suspicious documents against a collection of reference documents. Recent approaches perform well in identifying copied or modified foreign sections, but they assume a closed world where a reference collection is given. This article investigates the question whether plagiarism can be detected by a computer program if no reference can be provided, e.g., if the foreign sections stem from a book that is not available in digital form. We call this problem class intrinsic plagiarism analysis; it is closely related to the problem of authorship verification. Our contributions are threefold. (1) We organize the algorithmic building blocks for intrinsic plagiarism analysis and authorship verification and survey the state of the art. (2) We show how the meta learning approach of Koppel and Schler, termed “unmasking”, can be employed to post-process unreliable stylometric analysis results. (3) We operationalize and evaluate an analysis chain that combines document chunking, style model computation, one-class classification, and meta learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The reduction \(\le_{{tt}}^{p}\) is in O(|d|²); within this time all possible outliers can be constructed for a document d. The reduction \(\le_{{tt}}^{p}\) computes the answer to AVfind from the m answers to AVoutlier by means of a truth table tt, which is a disjunction here.
Function words and stop words are not disjunct sets: most function words in fact are stop words; however, the converse does not hold.
The corpus can be downloaded at http://www.webis.de/research/corpora.

References

Argamon, S., Šarić, M., & Stein, S. S. (2003). Style mining of electronic messages for multiple authorship discrimination: First results. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 475–480). New York, NY, USA: ACM. ISBN 1-58113-737-0. doi:10.1145/956750.956805.
Bernstein, Y., & Zobel, J. (2004). A scalable system for identifying co-derivative documents. In A. Apostolico & M. Melucci (Eds.), Proceedings of the string processing and information retrieval symposium (SPIRE) (pp. 55–67). Padova, Italy: Springer. Published as LNCS 3246.
Brin, S., Davis, J., & Garcia-Molina, H. (1995). Copy detection mechanisms for digital documents. In SIGMOD ’95 (pp. 398–409). New York, NY, USA: ACM Press. ISBN 0-89791-731-6.
Broder, A. Z., Eiron, N., Fontoura, M., Herscovici, M., Lempel, R., McPherson, J., et al. (2006). Indexing shared content in information retrieval systems. In EDBT ’06 (pp. 313–330).
Chall, J. S., & Dale, E. (1995). Readability revisited: The new Dale–Chall readability formula. Cambridge, MA: Brookline Books.
Google Scholar
Chaski, C. E. (2005). Who’s at the keyboard? authorship attribution in digital evidence investigations. IJDE, 4(1), 1–14.
Google Scholar
Chawla, N. V., Bowyer, K. W., Kegelmeyer, P. W. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357.
Google Scholar
Choi, F. Y. Y. (2000). Advances in domain independent linear text segmentation. In Proceedings of the first conference on North American chapter of the association for computational linguistics (pp. 26–33). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Dale, E., & Chall, J. S. (1948). A formula for predicting readability. Educational Research Bulletin, 27, 11–20.
Google Scholar
Finkel, R. A., Zaslavsky, A., Monostori, K., & Schmidt, H. (2002). Signature extraction for overlap detection in documents. In Proceedings of the 25th Australian conference on Computer science (pp. 59–64). Australian Computer Society, Inc. ISBN 0-909925-82-8.
Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32, 221–233.
Article Google Scholar
Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing. In Proceedings of the 25th VLDB conference Edinburgh, Scotland (pp. 518–529).
Graham, N., Hirst, G., & Marthi, B. (2005). Segmenting a document by stylistic character. Natural Language Engineering, 11(4), 397–415. Supersedes August 2003 workshop version.
Google Scholar
Gunning, R. (1952). The technique of clear writing. New York: McGraw-Hill.
Google Scholar
Henzinger, M. (2006). Finding near-duplicate web pages: A large-scale evaluation of algorithms. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 284–291). New York, NY, USA: ACM Press. ISBN 1-59593-369-7. doi:10.1145/1148170.1148222.
Hilton, M. L., & Holmes, D. I. (1993). An assessment of cumulative sum charts for authorship attribution. Literary and Linguistic Computing, 8(2), 73–80.
Article Google Scholar
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504–507.
Article Google Scholar
Hoad, T. C., & Zobel, J. (2003). Methods for identifying versioned and plagiarised documents. American Society for Information Science and Technology, 54(3), 203–215.
Article Google Scholar
Holmes, D. I. (1998). The evolution of stylometry in humanities scholarship. Literary and Linguistic, 13(3), 111–117. doi:10.1093/llc/13.3.111.
Honore, A. (1979). Some simple measures of richness of vocabulary. Association for Literary and Linguistic Computing Bulletin, 7(2), 172–177.
Google Scholar
Indyk, P., & Motwani, R. (1998). Approximate nearest neighbor—Towards removing the curse of dimensionality. In Proceedings of the 30th symposium on theory of computing (pp. 604–613).
Juola, P. (2006). Authorship attribution. Foundation Trends Information Retrieval 1(3), 233–334, ISSN 1554-0669. doi:10.1561/1500000005.
Kacmarcik, G., & Gamon, M. (2006). Obfuscating document stylometry to preserve author anonymity. In Proceedings of the COLING/ACL on main conference poster sessions (pp. 444–451). Morristown, NJ, USA: Association for Computational Linguistics.
Kincaid, J., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Research branch report 8–75. Millington TN: Naval Technical Training US Naval Air Station.
Kjell, B., Woods Addison, W., & Frieder, O. (1994). Discrimination of authorship using visualization. Information Processing and Management, 30(1), 141–150. ISSN 0306-4573. doi:10.1016/0306-4573(94)90029-9.
Kleinberg, J. (1997). Two algorithms for nearest-neighbor search in high dimensions. In STOC ’97: Proceedings of the twenty-ninth annual ACM symposium on theory of computing.
Koppel, M., & Schler, J. (2003). Exploiting stylistic idiosyncrasies for authorship attribution. In Proceedings of IJCAI’03 workshop on computational approaches to style analysis and synthesis. Mexico: Acapulco.
Koppel, M., & Schler, J. (2004a). Authorship verification as a one-class classification problem. In ICML ’04: Proceedings of the twenty-first international conference on Machine learning (pp. 62). New York, NY, USA: ACM. ISBN 1-58113-828-5. doi:10.1145/1015330.1015448.
Koppel, M., & Schler, J. (2004b). Authorship verification as a one-class classification problem. In Proceedings of the 21st international conference on machine learning. Banff, Canada: ACM Press.
Koppel, M., Schler, J., Argamon, S., & Messeri, E. (2006). Authorship attribution with thousands of candidate authors. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 659–660). New York, NY, USA: ACM. ISBN 1-59593-369-7. doi:10.1145/1148170.1148304.
Koppel, M., Schler, J., & Bonchek-Dokow, E. (2007). Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research, 8, 1261–1276. ISSN 1533-7928.
Google Scholar
Koppel, M., Schler, J., & Argamon, S. (2009). Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1), 9–26.
Article Google Scholar
Malyutov, M. B. (2006). Authorship attribution of texts: A review. Lecture Notes in Computer Science, 2063, 362–380.
Manevitz, L. M., & Yousef, M. (2001). One-class SVMs for document classification. Journal of Machine Learning Research, 2, 139–154.
Article Google Scholar
Mansfield, J. S. (2004). Textbook plagiarism in psy101 general psychology: incidence and prevention. In Proceedings of the 18th annual conference on undergraduate teaching of psychology: Ideas and innovations. New York, USA: SUNY Farmingdale.
Meyer zu Eissen, S., & Stein, B. (2004). Genre classification of web pages: User study and feasibility analysis. In S. Biundo, T. Frühwirth, & G. Palm (Eds.), KI 2004: Advances in artificial intelligence, vol. 3228 LNAI of Lecture Notes in artificial intelligence (pp. 256–269). Berlin Heidelberg New York: Springer. ISBN 0302-9743.
Meyer zu Eissen, S., & Stein, B. (2006). Intrinsic plagiarism detection. In M. Lalmas, A. MacFarlane, S. M. Rüger, A. Tombros, T. Tsikrika, & A. Yavlinsky (Eds.), Proceedings of the European conference on information retrieval (ECIR 2006), vol. 3936 of Lecture Notes in Computer Science (pp. 565–569). New York: Springer. ISBN 3-540-33347-9.
Meyer zu Eissen, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H. J. Lenz (Eds.), Advances in data analysis (pp. 359–366). New York: Springer. ISBN 978-3-540-70980-0.
Morton, A. Q., & Michaelson, S. (1990). The qsum plot. Technical report, University of Edinburgh.
Mosteller, F., & Wallace, D. L. (1964). Inference and disputed authorship: Federalist papers. Reading, MA: Addison-Wesley Educational Publishers Inc, 1964. ISBN 0201048655.
Pavelec, D., Oliveira, L. S., Justino, E. J. R., & Batista, L. V. (2008). Using conjunctions and adverbs for author verification. Journal of UCS, 14(18), 2967–2981.
Google Scholar
Potthast, M., Eiselt, A., Stein, B., Barròn Cedeño, A., & Rosso, P. (Eds.). (2009). Webis at Bauhaus-Universität Weimar and NLEL at Universidad Polytécnica de Valencia. PAN Plagiarism Corpus 2009 (PAN-PC-09). http://www.webis.de/research/corpora.
Rätsch, G., Mika, S., Schölkopf, B., & Müller, K.-R. (2002). Constructing boosting algorithms from SVMs: An application to one-class classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1184–1199. ISSN 0162-8828. doi:10.1109/TPAMI.2002.1033211.
Google Scholar
Reynar, J. C. (1998). Topic segmentation: Algorithms and applications. Ph.D. thesis, University of Pennsylvania.
Rudman, J. (1997). The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities, 31, 351–365.
Article Google Scholar
Russel, S. J., & Norvig, P. (1995). Artificial intelligence: A modern approach. Englewood Cliffs, NJ: Prentice-Hall.
Google Scholar
Sanderson, C., & Guenter, S. (2006a). On authorship attribution via markov chains and sequence kernels. In Pattern recognition, 2006. ICPR 2006. 18th international conference on (vol. 3, pp. 437–440). doi:10.1109/ICPR.2006.899.
Sanderson, C., & Guenter, S. (2006b). Short text authorship attribution via sequence kernels, markov chains and author unmasking: An investigation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (pp. 482–491). URL http://acl.ldc.upenn.edu/W/W06/W06-1657.pdf.
Stamatatos, E. (2007). Author identification using imbalanced and limited training texts. In A. M. Tjoa & R. R. Wagner (Eds.), 18th international conference on database and expert systems applications (DEXA 07) (pp. 237–241). IEEE, September 2007. ISBN 0-7695-2932-1. doi: 10.1109/DEXA.2007.37.
Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of American Society for Information Science & Technology, 60(3), 538–556. ISSN 1532-2882. doi:10.1002/asi.v60:3.
Google Scholar
Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2001). Computer-based authorship attribution without lexical measures. Computers and the Humanities, 35, 193–214.
Article Google Scholar
Stefik, M. (1995). Introduction to knowledge systems. San Mateo, CA, USA: Morgan Kaufmann.
Google Scholar
Stein, B. (2005). Fuzzy-fingerprints for text-based information retrieval. In K. Tochtermann & H. Maurer (Eds.), Proceedings of the 5th international conference on knowledge management (I-KNOW 05), Graz, Journal of Universal Computer Science (pp. 572–579). Know-Center.
Stein, B. (2007). Principles of hash-based text retrieval. In C. Clarke, N. Fuhr, N. Kando, W. Kraaij, & A. de Vries (Eds.), 30th annual international ACM SIGIR conference (pp. 527–534). ACM, July 2007. ISBN 987-1-59593-597-7.
Stein, B., & Meyer zu Eissen, S. (2007). Intrinsic plagiarism analysis with meta learning. In B. Stein, M. Koppel, & E. Stamatatos (Eds.), SIGIR workshop workshop on plagiarism analysis, authorship identification, and near-duplicate detection (PAN 07) (pp. 45–50). CEUR-WS.org, July 2007. URL http://ceur-ws.org/Vol-276.
Stein, B., & Meyer zu Eissen, S. (2007). Topic-identifikation: Formalisierung, analyse und neue Verfahren. KI—Künstliche Intelligenz, 3, 16–22. ISSN 0933-1875. URL http://www.kuenstliche-intelligenz.de/index.php?id=7758.
Stein, B., Lipka, N., & Meyer zu Eissen, S. (2008). Meta analysis within authorship verification. In A. M. Tjoa & R. R. Wagner (Eds.), 19th international conference on database and expert systems applications (DEXA 08) (pp. 34–39). IEEE, September 2008. ISBN 978-0-7695-3299-8. doi:10.1109/DEXA.2008.20.
Surdulescu R. (2004). Verifying authorship. Final project report CS391L, University of Texas at Austin
Tax, D. M. J. (2001). One-class classification. Ph.D. thesis, Technische Universiteit Delft.
Tax D. M. J., & Duin, R. P. W. (2001). Combining one-class classifiers. In Proceedings of the second international workshop on multiple classifier systems (pp. 299–308). New York: Springer. ISBN 3-540-42284-6.
Tweedie, F. J., & Baayen, H. R. (1998). How variable may a constant be? measures of lexical richness in perspective. Computers and the Humanities 32(5):323–352. doi:10.1023/A:1001749303137.
Article Google Scholar
van Halteren, H. (2004). Linguistic profiling for author recognition and verification. In ACL ’04: Proceedings of the 42nd annual meeting on association for computational linguistics (pp. 199). Morristown, NJ, USA: Association for Computational Linguistics. doi:10.3115/1218955.1218981.
van Halteren, H. (2007). Author verification by linguistic profiling: An exploration of the parameter space. ACM Transactions on Speech and Language Processing, 4(1), 1. ISSN 1550-4875. doi:10.1145/1187415.1187416.
Yang, H., & Callan, J. P. (2006). Near-duplicate detection by instance-level constrained clustering. In E. N. Efthimiadis, S. Dumais, D. Hawking, & K. Järvelin (Eds.), SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 421–428). ISBN 1-59593-369-7.
Yule, G. (1944). The statistical study of literary vocabulary. Cambridge: Cambridge University Press
Google Scholar
Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378–393. doi:10.1002/asi.20316.
Google Scholar
Zipf, G. K. (1932). Selective studies and the principle of relative frequency in language.

Download references

Author information

Authors and Affiliations

Faculty of Media, Media Systems, Bauhaus-Universität Weimar, 99421, Weimar, Germany
Benno Stein, Nedim Lipka & Peter Prettenhofer

Authors

Benno Stein
View author publications
You can also search for this author in PubMed Google Scholar
Nedim Lipka
View author publications
You can also search for this author in PubMed Google Scholar
Peter Prettenhofer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Benno Stein.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stein, B., Lipka, N. & Prettenhofer, P. Intrinsic plagiarism analysis. Lang Resources & Evaluation 45, 63–82 (2011). https://doi.org/10.1007/s10579-010-9115-y

Download citation

Published: 20 January 2010
Issue Date: March 2011
DOI: https://doi.org/10.1007/s10579-010-9115-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Intrinsic plagiarism analysis

Abstract

Access this article

Similar content being viewed by others

Plagiarism Detection Software: Promises, Pitfalls, and Practices

Plagiarism Detection Software: Promises, Pitfalls, and Practices

Plagiarism Detection Software: Promises, Pitfalls, and Practices

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Intrinsic plagiarism analysis

Abstract

Access this article

Similar content being viewed by others

Plagiarism Detection Software: Promises, Pitfalls, and Practices

Plagiarism Detection Software: Promises, Pitfalls, and Practices

Plagiarism Detection Software: Promises, Pitfalls, and Practices

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation