Combining Machine Learning and Semantic Features in the Classification of Corporate Disclosures

Evert, Stefan; Heinrich, Philipp; Henselmann, Klaus; Rabenstein, Ulrich; Scherr, Elisabeth; Schmitt, Martin; Schröder, Lutz

doi:10.1007/s10849-019-09283-6

Combining Machine Learning and Semantic Features in the Classification of Corporate Disclosures

Published: 28 February 2019

Volume 28, pages 309–330, (2019)
Cite this article

Journal of Logic, Language and Information Aims and scope Submit manuscript

Stefan Evert¹,
Philipp Heinrich ORCID: orcid.org/0000-0002-4785-9205¹,
Klaus Henselmann²,
Ulrich Rabenstein³,
Elisabeth Scherr²,
Martin Schmitt⁴ &
…
Lutz Schröder³

725 Accesses
2 Altmetric
Explore all metrics

Abstract

We investigate an approach to improving statistical text classification by combining machine learners with an ontology-based identification of domain-specific topic categories. We apply this approach to ad hoc disclosures by public companies. This form of obligatory publicity concerns all information that might affect the stock price; relevant topic categories are governed by stringent regulations. Our goal is to classify disclosures according to their effect on stock prices (negative, neutral, positive). In the study reported here, we combine natural language parsing with a formal background ontology to recognize disclosures concerning particular topics from a prescribed list. The semantic analysis identifies some of these topics with reasonable accuracy. We then demonstrate that machine learners benefit from the additional ontology-based information when predicting the cumulative abnormal return attributed to the disclosure at hand.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Predicting Corporate Credit Ratings Using Content Analysis of Annual Reports – A Naïve Bayesian Network Approach

Text classification algorithms for mining unstructured data: a SWOT analysis

Article 05 February 2018

Using Domain Ontologies for Text Classification. A Use Case to Classify Computer Science Papers

Notes

See the guideline issued by the Federal Financial Supervisory Authority (BaFin) (2009) for a list of possible price-sensitive events.
German law requires the material event disclosures to be in German, in another accepted language or in English depending on specific criteria.
The Composite DAX (CDAX) is a stock market index based on German stocks that are listed in the General Standard or Prime Standard market segments, see http://deutsche-boerse.com/dbg-en/about-us/services/know-how/glossary/glossary-article/CDAX/2560202.
Non-trading days are excluded.
That is to say: all categories contain equal numbers of disclosures in each fold of the cross-validation.
The trading strategy rests on the assumption that we can buy or sell the shares after the material event and thus indeed collect the net gain of CAR values.
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.
Synsets are sets of synonyms representing a lexical semantic concept or word sense.
ABox and TBox are the assertion and terminological components of the ontology, respectively.
The universal validity of these axioms may be debatable but since OWL does not incorporate default reasoning, there appears to be no realistic way to ensure stricter accuracy.
161 of the 178 true retirement messages were detected correctly by the algorithm (true positives) while 5 disclosures were incorrectly marked as retirements (false positives).
As indicated Sect. 2.2, we only report results for MaxEnt here. Other ML classifiers perform similarly.
The setting with the single feature is omitted here because its lemma feature weights are almost identical to the vanilla bag of words. Recall that we just add a single feature indicating the ontological category in this feature set, so it cannot account for the differences in language use between retirements and other messages that we are interested in here.

References

Banerjee, S., & Pedersen, T. (2002). An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the third international conference on computational linguistics and intelligent text processing. Springer, London, UK, CICLing ’02, pp. 136–145. http://dl.acm.org/citation.cfm?id=647344.724142.
Beyer, A., Cohen, D., Lys, T., & Walther, B. (2010). The financial reporting environment: Review of the recent literature. Journal of Accounting and Economics, 50(2–3), 296–343. https://doi.org/10.1016/j.jacceco.2010.10.003.
Article Google Scholar
Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.
Book Google Scholar
Bloomfield, R. (2002). The ‘incomplete revelation hypothesis’ and financial reporting. Accounting Horizons, 16, 233–243.
Article Google Scholar
Bollen, J., Mao, H., & Zeng, X. J. (2010). Twitter mood predicts the stock market. Journal of Computational Science, 2(1), 1–8.
Article Google Scholar
Carter, M., & Soo, B. (1999). The relevance of form 8-k reports. Journal of Accounting Research, 37, 119–132.
Article Google Scholar
Corrado, C. (2011). Event studies: A methodology review. Accounting and Finance, 51, 207–234.
Article Google Scholar
Ding, X., Zhang, Y., Liu, T., & Duan, J. (2015). Deep learning for event-driven stock prediction. In Proceedings of the twenty-fourth international joint conference on artificial intelligence (ICJAI). pp. 2327–2333. http://ijcai.org/papers15/Papers/IJCAI15-329.pdf.
Evert, S., Proisl, T., Greiner, P., & Kabashi, B. (2014). SentiKLUE: Updating a polarity classifier in 48 hours. In Nakov, P., Zesch, T. (eds) Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014). Association for Computational Linguistics, Dublin, pp. 551–555. http://www.aclweb.org/anthology/S14-2096.
Federal Financial Supervisory Authority (BaFin) (2009) Issuer guidelines.
Feuerriegel, S., Ratku, A., & Neumann, D. (2015). Which news disclosures matter? News reception compared across topics extracted from the latent Dirichlet allocation. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2564603.
Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA, ACL’05, pp. 363–370.https://doi.org/10.3115/1219840.1219885.
Güttler, A. (2005). Wird die ad-hoc-publizität korrekt umgesetzt? Eine empirische analyse unter einbezug von unternehmen des neuen marktes. Zeitschrift für betriebswirtschaftliche forschung, pp. 237–259.
Jegadeesh, N., & Wu, D. (2013). Word power: A new approach for content analysis. J Financial Economics, 110, 712–729.
Article Google Scholar
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284. https://doi.org/10.1080/01638539809545028.
Article Google Scholar
Lee, H., Chang, A., Peirsman, Y., Chambers, N., Surdeanu, M., & Jurafsky, D. (2013). Deterministic coreference resolution based on entity-centric, precision-ranked rules. Comput Linguist, 39(4), 885–916. https://doi.org/10.1162/COLI_a_00152.
Article Google Scholar
Lerman, A., & Livnat, J. (2010). The new Form 8-k disclosures. Review of Accounting Studies, 15, 752–778.
Article Google Scholar
Lüngen, H., Beißwenger, M., Selzam, B., & Storrer, A. (2012). Modelling and processing wordnets in OWL. In Mehler, A., Kühnberger, K., Lobin, H., Lüngen, H., Storrer, A., & Witt, A. (Eds.), Modeling, learning, and processing of text technological data structures, studies in computational intelligence (Vol. 370, pp. 347–376). Springer.
Lüngen, H., Beißwenger, M., Selzam, B., & Storrer, A. (2012). Modelling and processing wordnets in OWL. In Mehler, A., Kühnberger, K., Lobin, H., Lüngen, H., Storrer, A., & Witt, A. (Eds.), Modeling, learning, and processing of text technological data structures, studies in computational intelligence (Vol. 370, pp. 347–376). Springer.
McWilliams, A., & Siegel, D. (1997). Event studies in management research: Theoretical and empirical issues. Academy of Management J, 40, 626–657.
Google Scholar
Miller, G. A. (1995). WordNet: A lexical database for english. Communications of the ACM, 38(11), 39.
Article Google Scholar
Nini, A. (2015). Multidimensional analysis tagger (version 1.3). http://sites.google.com/site/multidimensionaltagger.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.
Google Scholar
Strong, N. (1992). Modelling abnormal returns: A review article. J Business Finance & Accounting, 19, 533–553.
Article Google Scholar
Toutanova, K., & Manning, C. D. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 joint SIGDAT conference on empirical methods in natural language processing and very large corpora: Held in conjunction with the 38th annual meeting of the association for computational linguistics—volume 13. Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP ’00, pp. 63–70. https://doi.org/10.3115/1117794.1117802.
Verchow, T. (2011). Ad-hoc-Publizität und kapitalmarkteffizienz: Eine untersuchung basierend auf der textanalyse von ad-hoc-mitteilungen. PhD thesis, Ulm University, Faculty of Mathematics and Economics.

Download references

Author information

Authors and Affiliations

Department of German and Comparative Studies, FAU Erlangen-Nürnberg, Bismarckstr. 6, 91054, Erlangen, Germany
Stefan Evert & Philipp Heinrich
School of Business and Economics, FAU Erlangen-Nürnberg, Erlangen, Germany
Klaus Henselmann & Elisabeth Scherr
Department of Computer Science, FAU Erlangen-Nürnberg, Erlangen, Germany
Ulrich Rabenstein & Lutz Schröder
Center for Information and Language Processing, LMU München, Munich, Germany
Martin Schmitt

Authors

Stefan Evert
View author publications
You can also search for this author inPubMed Google Scholar
Philipp Heinrich
View author publications
You can also search for this author inPubMed Google Scholar
Klaus Henselmann
View author publications
You can also search for this author inPubMed Google Scholar
Ulrich Rabenstein
View author publications
You can also search for this author inPubMed Google Scholar
Elisabeth Scherr
View author publications
You can also search for this author inPubMed Google Scholar
Martin Schmitt
View author publications
You can also search for this author inPubMed Google Scholar
Lutz Schröder
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Philipp Heinrich.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Evert, S., Heinrich, P., Henselmann, K. et al. Combining Machine Learning and Semantic Features in the Classification of Corporate Disclosures. J of Log Lang and Inf 28, 309–330 (2019). https://doi.org/10.1007/s10849-019-09283-6

Download citation

Published: 28 February 2019
Issue Date: 15 June 2019
DOI: https://doi.org/10.1007/s10849-019-09283-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining Machine Learning and Semantic Features in the Classification of Corporate Disclosures

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Predicting Corporate Credit Ratings Using Content Analysis of Annual Reports – A Naïve Bayesian Network Approach

Text classification algorithms for mining unstructured data: a SWOT analysis

Using Domain Ontologies for Text Classification. A Use Case to Classify Computer Science Papers

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now