Abstract
We investigate an approach to improving statistical text classification by combining machine learners with an ontology-based identification of domain-specific topic categories. We apply this approach to ad hoc disclosures by public companies. This form of obligatory publicity concerns all information that might affect the stock price; relevant topic categories are governed by stringent regulations. Our goal is to classify disclosures according to their effect on stock prices (negative, neutral, positive). In the study reported here, we combine natural language parsing with a formal background ontology to recognize disclosures concerning particular topics from a prescribed list. The semantic analysis identifies some of these topics with reasonable accuracy. We then demonstrate that machine learners benefit from the additional ontology-based information when predicting the cumulative abnormal return attributed to the disclosure at hand.



Similar content being viewed by others
Notes
See the guideline issued by the Federal Financial Supervisory Authority (BaFin) (2009) for a list of possible price-sensitive events.
German law requires the material event disclosures to be in German, in another accepted language or in English depending on specific criteria.
The Composite DAX (CDAX) is a stock market index based on German stocks that are listed in the General Standard or Prime Standard market segments, see http://deutsche-boerse.com/dbg-en/about-us/services/know-how/glossary/glossary-article/CDAX/2560202.
Non-trading days are excluded.
That is to say: all categories contain equal numbers of disclosures in each fold of the cross-validation.
The trading strategy rests on the assumption that we can buy or sell the shares after the material event and thus indeed collect the net gain of CAR values.
Synsets are sets of synonyms representing a lexical semantic concept or word sense.
ABox and TBox are the assertion and terminological components of the ontology, respectively.
The universal validity of these axioms may be debatable but since OWL does not incorporate default reasoning, there appears to be no realistic way to ensure stricter accuracy.
161 of the 178 true retirement messages were detected correctly by the algorithm (true positives) while 5 disclosures were incorrectly marked as retirements (false positives).
As indicated Sect. 2.2, we only report results for MaxEnt here. Other ML classifiers perform similarly.
The setting with the single feature is omitted here because its lemma feature weights are almost identical to the vanilla bag of words. Recall that we just add a single feature indicating the ontological category in this feature set, so it cannot account for the differences in language use between retirements and other messages that we are interested in here.
References
Banerjee, S., & Pedersen, T. (2002). An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the third international conference on computational linguistics and intelligent text processing. Springer, London, UK, CICLing ’02, pp. 136–145. http://dl.acm.org/citation.cfm?id=647344.724142.
Beyer, A., Cohen, D., Lys, T., & Walther, B. (2010). The financial reporting environment: Review of the recent literature. Journal of Accounting and Economics, 50(2–3), 296–343. https://doi.org/10.1016/j.jacceco.2010.10.003.
Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.
Bloomfield, R. (2002). The ‘incomplete revelation hypothesis’ and financial reporting. Accounting Horizons, 16, 233–243.
Bollen, J., Mao, H., & Zeng, X. J. (2010). Twitter mood predicts the stock market. Journal of Computational Science, 2(1), 1–8.
Carter, M., & Soo, B. (1999). The relevance of form 8-k reports. Journal of Accounting Research, 37, 119–132.
Corrado, C. (2011). Event studies: A methodology review. Accounting and Finance, 51, 207–234.
Ding, X., Zhang, Y., Liu, T., & Duan, J. (2015). Deep learning for event-driven stock prediction. In Proceedings of the twenty-fourth international joint conference on artificial intelligence (ICJAI). pp. 2327–2333. http://ijcai.org/papers15/Papers/IJCAI15-329.pdf.
Evert, S., Proisl, T., Greiner, P., & Kabashi, B. (2014). SentiKLUE: Updating a polarity classifier in 48 hours. In Nakov, P., Zesch, T. (eds) Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014). Association for Computational Linguistics, Dublin, pp. 551–555. http://www.aclweb.org/anthology/S14-2096.
Federal Financial Supervisory Authority (BaFin) (2009) Issuer guidelines.
Feuerriegel, S., Ratku, A., & Neumann, D. (2015). Which news disclosures matter? News reception compared across topics extracted from the latent Dirichlet allocation. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2564603.
Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA, ACL’05, pp. 363–370.https://doi.org/10.3115/1219840.1219885.
Güttler, A. (2005). Wird die ad-hoc-publizität korrekt umgesetzt? Eine empirische analyse unter einbezug von unternehmen des neuen marktes. Zeitschrift für betriebswirtschaftliche forschung, pp. 237–259.
Jegadeesh, N., & Wu, D. (2013). Word power: A new approach for content analysis. J Financial Economics, 110, 712–729.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284. https://doi.org/10.1080/01638539809545028.
Lee, H., Chang, A., Peirsman, Y., Chambers, N., Surdeanu, M., & Jurafsky, D. (2013). Deterministic coreference resolution based on entity-centric, precision-ranked rules. Comput Linguist, 39(4), 885–916. https://doi.org/10.1162/COLI_a_00152.
Lerman, A., & Livnat, J. (2010). The new Form 8-k disclosures. Review of Accounting Studies, 15, 752–778.
Lüngen, H., Beißwenger, M., Selzam, B., & Storrer, A. (2012). Modelling and processing wordnets in OWL. In Mehler, A., Kühnberger, K., Lobin, H., Lüngen, H., Storrer, A., & Witt, A. (Eds.), Modeling, learning, and processing of text technological data structures, studies in computational intelligence (Vol. 370, pp. 347–376). Springer.
Lüngen, H., Beißwenger, M., Selzam, B., & Storrer, A. (2012). Modelling and processing wordnets in OWL. In Mehler, A., Kühnberger, K., Lobin, H., Lüngen, H., Storrer, A., & Witt, A. (Eds.), Modeling, learning, and processing of text technological data structures, studies in computational intelligence (Vol. 370, pp. 347–376). Springer.
McWilliams, A., & Siegel, D. (1997). Event studies in management research: Theoretical and empirical issues. Academy of Management J, 40, 626–657.
Miller, G. A. (1995). WordNet: A lexical database for english. Communications of the ACM, 38(11), 39.
Nini, A. (2015). Multidimensional analysis tagger (version 1.3). http://sites.google.com/site/multidimensionaltagger.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.
Strong, N. (1992). Modelling abnormal returns: A review article. J Business Finance & Accounting, 19, 533–553.
Toutanova, K., & Manning, C. D. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 joint SIGDAT conference on empirical methods in natural language processing and very large corpora: Held in conjunction with the 38th annual meeting of the association for computational linguistics—volume 13. Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP ’00, pp. 63–70. https://doi.org/10.3115/1117794.1117802.
Verchow, T. (2011). Ad-hoc-Publizität und kapitalmarkteffizienz: Eine untersuchung basierend auf der textanalyse von ad-hoc-mitteilungen. PhD thesis, Ulm University, Faculty of Mathematics and Economics.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Evert, S., Heinrich, P., Henselmann, K. et al. Combining Machine Learning and Semantic Features in the Classification of Corporate Disclosures. J of Log Lang and Inf 28, 309–330 (2019). https://doi.org/10.1007/s10849-019-09283-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10849-019-09283-6