Skip to main content
Log in

Combining Machine Learning and Semantic Features in the Classification of Corporate Disclosures

  • Published:
Journal of Logic, Language and Information Aims and scope Submit manuscript

Abstract

We investigate an approach to improving statistical text classification by combining machine learners with an ontology-based identification of domain-specific topic categories. We apply this approach to ad hoc disclosures by public companies. This form of obligatory publicity concerns all information that might affect the stock price; relevant topic categories are governed by stringent regulations. Our goal is to classify disclosures according to their effect on stock prices (negative, neutral, positive). In the study reported here, we combine natural language parsing with a formal background ontology to recognize disclosures concerning particular topics from a prescribed list. The semantic analysis identifies some of these topics with reasonable accuracy. We then demonstrate that machine learners benefit from the additional ontology-based information when predicting the cumulative abnormal return attributed to the disclosure at hand.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. See the guideline issued by the Federal Financial Supervisory Authority (BaFin) (2009) for a list of possible price-sensitive events.

  2. German law requires the material event disclosures to be in German, in another accepted language or in English depending on specific criteria.

  3. The Composite DAX (CDAX) is a stock market index based on German stocks that are listed in the General Standard or Prime Standard market segments, see http://deutsche-boerse.com/dbg-en/about-us/services/know-how/glossary/glossary-article/CDAX/2560202.

  4. Non-trading days are excluded.

  5. That is to say: all categories contain equal numbers of disclosures in each fold of the cross-validation.

  6. The trading strategy rests on the assumption that we can buy or sell the shares after the material event and thus indeed collect the net gain of CAR values.

  7. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

  8. Synsets are sets of synonyms representing a lexical semantic concept or word sense.

  9. ABox and TBox are the assertion and terminological components of the ontology, respectively.

  10. The universal validity of these axioms may be debatable but since OWL does not incorporate default reasoning, there appears to be no realistic way to ensure stricter accuracy.

  11. 161 of the 178 true retirement messages were detected correctly by the algorithm (true positives) while 5 disclosures were incorrectly marked as retirements (false positives).

  12. As indicated Sect. 2.2, we only report results for MaxEnt here. Other ML classifiers perform similarly.

  13. The setting with the single feature is omitted here because its lemma feature weights are almost identical to the vanilla bag of words. Recall that we just add a single feature indicating the ontological category in this feature set, so it cannot account for the differences in language use between retirements and other messages that we are interested in here.

References

  • Banerjee, S., & Pedersen, T. (2002). An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the third international conference on computational linguistics and intelligent text processing. Springer, London, UK, CICLing ’02, pp. 136–145. http://dl.acm.org/citation.cfm?id=647344.724142.

  • Beyer, A., Cohen, D., Lys, T., & Walther, B. (2010). The financial reporting environment: Review of the recent literature. Journal of Accounting and Economics, 50(2–3), 296–343. https://doi.org/10.1016/j.jacceco.2010.10.003.

    Article  Google Scholar 

  • Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Bloomfield, R. (2002). The ‘incomplete revelation hypothesis’ and financial reporting. Accounting Horizons, 16, 233–243.

    Article  Google Scholar 

  • Bollen, J., Mao, H., & Zeng, X. J. (2010). Twitter mood predicts the stock market. Journal of Computational Science, 2(1), 1–8.

    Article  Google Scholar 

  • Carter, M., & Soo, B. (1999). The relevance of form 8-k reports. Journal of Accounting Research, 37, 119–132.

    Article  Google Scholar 

  • Corrado, C. (2011). Event studies: A methodology review. Accounting and Finance, 51, 207–234.

    Article  Google Scholar 

  • Ding, X., Zhang, Y., Liu, T., & Duan, J. (2015). Deep learning for event-driven stock prediction. In Proceedings of the twenty-fourth international joint conference on artificial intelligence (ICJAI). pp. 2327–2333. http://ijcai.org/papers15/Papers/IJCAI15-329.pdf.

  • Evert, S., Proisl, T., Greiner, P., & Kabashi, B. (2014). SentiKLUE: Updating a polarity classifier in 48 hours. In Nakov, P., Zesch, T. (eds) Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014). Association for Computational Linguistics, Dublin, pp. 551–555. http://www.aclweb.org/anthology/S14-2096.

  • Federal Financial Supervisory Authority (BaFin) (2009) Issuer guidelines.

  • Feuerriegel, S., Ratku, A., & Neumann, D. (2015). Which news disclosures matter? News reception compared across topics extracted from the latent Dirichlet allocation. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2564603.

  • Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA, ACL’05, pp. 363–370.https://doi.org/10.3115/1219840.1219885.

  • Güttler, A. (2005). Wird die ad-hoc-publizität korrekt umgesetzt? Eine empirische analyse unter einbezug von unternehmen des neuen marktes. Zeitschrift für betriebswirtschaftliche forschung, pp. 237–259.

  • Jegadeesh, N., & Wu, D. (2013). Word power: A new approach for content analysis. J Financial Economics, 110, 712–729.

    Article  Google Scholar 

  • Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284. https://doi.org/10.1080/01638539809545028.

    Article  Google Scholar 

  • Lee, H., Chang, A., Peirsman, Y., Chambers, N., Surdeanu, M., & Jurafsky, D. (2013). Deterministic coreference resolution based on entity-centric, precision-ranked rules. Comput Linguist, 39(4), 885–916. https://doi.org/10.1162/COLI_a_00152.

    Article  Google Scholar 

  • Lerman, A., & Livnat, J. (2010). The new Form 8-k disclosures. Review of Accounting Studies, 15, 752–778.

    Article  Google Scholar 

  • Lüngen, H., Beißwenger, M., Selzam, B., & Storrer, A. (2012). Modelling and processing wordnets in OWL. In Mehler, A., Kühnberger, K., Lobin, H., Lüngen, H., Storrer, A., & Witt, A. (Eds.), Modeling, learning, and processing of text technological data structures, studies in computational intelligence (Vol. 370, pp. 347–376). Springer.

  • Lüngen, H., Beißwenger, M., Selzam, B., & Storrer, A. (2012). Modelling and processing wordnets in OWL. In Mehler, A., Kühnberger, K., Lobin, H., Lüngen, H., Storrer, A., & Witt, A. (Eds.), Modeling, learning, and processing of text technological data structures, studies in computational intelligence (Vol. 370, pp. 347–376). Springer.

  • McWilliams, A., & Siegel, D. (1997). Event studies in management research: Theoretical and empirical issues. Academy of Management J, 40, 626–657.

    Google Scholar 

  • Miller, G. A. (1995). WordNet: A lexical database for english. Communications of the ACM, 38(11), 39.

    Article  Google Scholar 

  • Nini, A. (2015). Multidimensional analysis tagger (version 1.3). http://sites.google.com/site/multidimensionaltagger.

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.

    Google Scholar 

  • Strong, N. (1992). Modelling abnormal returns: A review article. J Business Finance & Accounting, 19, 533–553.

    Article  Google Scholar 

  • Toutanova, K., & Manning, C. D. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 joint SIGDAT conference on empirical methods in natural language processing and very large corpora: Held in conjunction with the 38th annual meeting of the association for computational linguistics—volume 13. Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP ’00, pp. 63–70. https://doi.org/10.3115/1117794.1117802.

  • Verchow, T. (2011). Ad-hoc-Publizität und kapitalmarkteffizienz: Eine untersuchung basierend auf der textanalyse von ad-hoc-mitteilungen. PhD thesis, Ulm University, Faculty of Mathematics and Economics.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Philipp Heinrich.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Evert, S., Heinrich, P., Henselmann, K. et al. Combining Machine Learning and Semantic Features in the Classification of Corporate Disclosures. J of Log Lang and Inf 28, 309–330 (2019). https://doi.org/10.1007/s10849-019-09283-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10849-019-09283-6

Keywords

Navigation