We investigate an approach to improving statistical text classification by combining machine learners with an ontology-based identification of domain-specific topic categories. We apply this approach to ad hoc disclosures by public companies. This form of obligatory publicity concerns all information that might affect the stock price; relevant topic categories are governed by stringent regulations. Our goal is to classify disclosures according to their effect on stock prices (negative, neutral, positive). In the study reported here, we combine natural language parsing with a formal background ontology to recognize disclosures concerning particular topics from a prescribed list. The semantic analysis identifies some of these topics with reasonable accuracy. We then demonstrate that machine learners benefit from the additional ontology-based information when predicting the cumulative abnormal return attributed to the disclosure at hand.

See the guideline issued by the Federal Financial Supervisory Authority (BaFin) (2009) for a list of possible price-sensitive events.
German law requires the material event disclosures to be in German, in another accepted language or in English depending on specific criteria.
The Composite DAX (CDAX) is a stock market index based on German stocks that are listed in the General Standard or Prime Standard market segments, see http://deutsche-boerse.com/dbg-en/about-us/services/know-how/glossary/glossary-article/CDAX/2560202.
Non-trading days are excluded.
That is to say: all categories contain equal numbers of disclosures in each fold of the cross-validation.
The trading strategy rests on the assumption that we can buy or sell the shares after the material event and thus indeed collect the net gain of CAR values.
Synsets are sets of synonyms representing a lexical semantic concept or word sense.
ABox and TBox are the assertion and terminological components of the ontology, respectively.
The universal validity of these axioms may be debatable but since OWL does not incorporate default reasoning, there appears to be no realistic way to ensure stricter accuracy.
161 of the 178 true retirement messages were detected correctly by the algorithm (true positives) while 5 disclosures were incorrectly marked as retirements (false positives).
As indicated Sect. 2.2, we only report results for MaxEnt here. Other ML classifiers perform similarly.
The setting with the single feature is omitted here because its lemma feature weights are almost identical to the vanilla bag of words. Recall that we just add a single feature indicating the ontological category in this feature set, so it cannot account for the differences in language use between retirements and other messages that we are interested in here.
Evert, S., Heinrich, P., Henselmann, K. et al. Combining Machine Learning and Semantic Features in the Classification of Corporate Disclosures. J of Log Lang and Inf 28, 309–330 (2019).
