Classifying news versus opinions in newspapers: Linguistic features for domain independence

K. R. KRÜGER; A. LUKOWIAK; J. SONNTAG; S. WARZECHA; M. STEDE

doi:10.1017/S1351324917000043

Classifying news versus opinions in newspapers: Linguistic features for domain independence

Published online by Cambridge University Press: 21 February 2017

K. R. KRÜGER ,

A. LUKOWIAK ,

J. SONNTAG ,

S. WARZECHA and

M. STEDE

Show author details

K. R. KRÜGER: Affiliation:
University of Potsdam, FSP Cognitive Science, Applied Computational Linguistics, Karl-Liebknecht-Straße 24-25, 14476 Potsdam, Germany e-mail: katarina.krueger@uni-potsdam.de, anna.lukowiak@uni-potsdam.de, jonathan.sonntag@uni-potsdam.de, saskia.warzecha@retresco.de, stede@uni-potsdam.de
A. LUKOWIAK: Affiliation:
University of Potsdam, FSP Cognitive Science, Applied Computational Linguistics, Karl-Liebknecht-Straße 24-25, 14476 Potsdam, Germany e-mail: katarina.krueger@uni-potsdam.de, anna.lukowiak@uni-potsdam.de, jonathan.sonntag@uni-potsdam.de, saskia.warzecha@retresco.de, stede@uni-potsdam.de
J. SONNTAG: Affiliation:
University of Potsdam, FSP Cognitive Science, Applied Computational Linguistics, Karl-Liebknecht-Straße 24-25, 14476 Potsdam, Germany e-mail: katarina.krueger@uni-potsdam.de, anna.lukowiak@uni-potsdam.de, jonathan.sonntag@uni-potsdam.de, saskia.warzecha@retresco.de, stede@uni-potsdam.de
S. WARZECHA: Affiliation:
University of Potsdam, FSP Cognitive Science, Applied Computational Linguistics, Karl-Liebknecht-Straße 24-25, 14476 Potsdam, Germany e-mail: katarina.krueger@uni-potsdam.de, anna.lukowiak@uni-potsdam.de, jonathan.sonntag@uni-potsdam.de, saskia.warzecha@retresco.de, stede@uni-potsdam.de
M. STEDE: Affiliation:
University of Potsdam, FSP Cognitive Science, Applied Computational Linguistics, Karl-Liebknecht-Straße 24-25, 14476 Potsdam, Germany e-mail: katarina.krueger@uni-potsdam.de, anna.lukowiak@uni-potsdam.de, jonathan.sonntag@uni-potsdam.de, saskia.warzecha@retresco.de, stede@uni-potsdam.de

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Newspaper text can be broadly divided in the classes ‘opinion’ (editorials, commentary, letters to the editor) and ‘neutral’ (reports). We describe a classification system for performing this separation, which uses a set of linguistically motivated features. Working with various English newspaper corpora, we demonstrate that it significantly outperforms bag-of-lemma and PoS-tag models. We conclude that the linguistic features constitute the best method for achieving robustness against change of newspaper or domain.

Type: Articles
Information: Natural Language Engineering , Volume 23 , Issue 5 , September 2017 , pp. 687 - 707

DOI: https://doi.org/10.1017/S1351324917000043 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Biber, D., and Conrad, S., 2009. Register, Genre, and Style. Cambridge, UK: Cambridge University Press.CrossRef Google Scholar

Bird, S., Loper, E., and Klein, E. 2009. Natural Language Processing with Python. Sebastopol, CA: OReilly Media Inc.Google Scholar

Charniak, E., Blaheta, D., Ge, N., Hall, K., Hale, J., and Johnson, M., 2000. BLLIP 1987-89 WSJ Corpus Release 1 LDC2000T43. DVD. Philadelphia: Linguistic Data Consortium.Google Scholar

de Marneffe, M.-C., MacCartney, B., and Manning, C. D., 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of the 5th Conference on International Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 449–454.Google Scholar

Esuli, A., and Sebastiani, F., 2006. SENTIWORDNET: a publicly available lexical resource for opinion mining. In Proceedings of the 5th Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 417–422.Google Scholar

Feldman, S., Marin, M., Ostendorf, M., and Gupta, M.R., 2009. Part-of-speech histograms for genre classification of text. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, pp. 4781–4784.Google Scholar

Finn, A., and Kushmerick, N. 2003. Learning to classify documents according to genre. In Proceedings of the Workshop on Computational Approaches to Style Analysis and Synthesis at the International Joint Conference on Artificial Intelligence (IJCAI 2003), Acapulco, Mexico.Google Scholar

Freund, L., Clarke, C. L. A., and Toms, E. G., 2006. Towards genre classification for IR in the workplace. In Proceedings of the 1st International Conference on Information Interaction in Context (IIiX), Copenhagen, Denmark, pp. 30–36.CrossRef Google Scholar

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H., 2009. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11 (1): 10–18.CrossRef Google Scholar

Hosmer, D. W., Lemeshow, S., and Sturdivant, R. X., 2013. Applied Logistic Regression. Hoboken, NJ: Wiley.CrossRef Google Scholar

Karlgren, J., and Cutting, D., 1994. Recognizing text genres with simple metrics using discriminant analysis. In Proceedings of the 15th Conference on Computational Linguistics (COLING 1994), vol. 2, Kyoto, Japan, pp. 1071–1075.CrossRef Google Scholar

Kessler, B., Nunberg, G., and Schütze, H., 1997. Automatic detection of text genre. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, pp. 32–38.Google Scholar

Lippmann, R., 1987. An introduction to computing with neural nets. ASSP Magazine, IEEE 4 (2): 4–22.CrossRef Google Scholar

Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., and McClosky, D., 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, pp. 55–60.CrossRef Google Scholar

Moore, A., and Lee, M. S., 1998. Cached sufficient statistics for efficient machine learning with large datasets. Journal of Artificial Intelligence Research 8 : 67–91.CrossRef Google Scholar

Pearl, J., 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco, CA: Morgan Kaufmann.Google Scholar

Petrenz, P., and Webber, B., 2011. Stable classification of text genres. Computational Linguistics 37 (2): 385–93.CrossRef Google Scholar

Plank, B. 2011. Corresponding genre sets based on the meta-data found in ACL/DCI corpus. http://www.let.rug.nl/~bplank/metadata/genre_files_updated.html. Accessed 2016-07-01.Google Scholar

Platt, J. 1998. Sequential minimal optimization: a fast algorithm for training support vector machines. Technical Report msr-tr-98-14, Microsoft Research.Google Scholar

Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., and Webber, B., 2008. The Penn Discourse TreeBank 2.0. In Proceedings of the 6th Conference on International Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, pp. 2961–2968.Google Scholar

Sandhaus, E., 2008. The New York Times Annotated Corpus LDC2008T19. DVD. Philadelphia: Linguistic Data Consortium.Google Scholar

Santini, M. 2007. Automatic Identification of Genre in Web Pages. PhD thesis, University of Brighton, UK.Google Scholar

Sharoff, S., Wu, Z., and Markert, K., 2010. The Web Library of Babel: evaluating genre collections. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC 2010), Valletta, Malta, pp. 3063–70.Google Scholar

Toprak, C., and Gurevych, I., 2009. Document level subjectivity classification experiments in DEFT’09 challenge. In Proceedings of the DÉfi Fouille de Textes (DEFT 2009) Text Mining Challenge, Paris, France, pp. 89–97.Google Scholar

Webber, B. L., 2009. Genre distinctions for discourse in the Penn TreeBank. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL 2009) and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 674–682.Google Scholar

Wiebe, J., Wilson, T., Bruce, R., Bell, M., and Martin, M., 2004. Learning subjective language. Computational Linguistics 30 (3): 277–308.CrossRef Google Scholar

Wilson, T., Wiebe, J., and Hoffmann, P., 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP 2005), Vancouver, B.C., pp. 347–354.Google Scholar

Yu, H., and Hatzivassiloglou, V., 2003. Towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2003), Stroudsburg, PA, pp. 129–136.CrossRef Google Scholar

Article contents

Classifying news versus opinions in newspapers: Linguistic features for domain independence

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests