Elsevier

Expert Systems with Applications

Volume 88, 1 December 2017, Pages 270-275
Expert Systems with Applications

AnnoFin–A hybrid algorithm to annotate financial text

https://doi.org/10.1016/j.eswa.2017.07.016Get rights and content

Highlights

  • AnnoFin helps a user to classify financial text data to ten categories.

  • AnnoFin, when trained with 30% of data and has an accuracy of 73.56%.

  • The accuracy increases by 2% if the training data is increased by 10%.

Abstract

In this work, we study the problem of annotating a large volume of Financial text by learning from a small set of human-annotated training data. The training data is prepared by randomly selecting some text sentences from the large corpus of financial text. Conventionally, bootstrapping algorithm is used to annotate large volume of unlabeled data by learning from a small set of annotated data. However, the small set of annotated data have to be carefully chosen as seed data. Thus, our approach is a digress from the conventional approach of bootstrapping as we let the users randomly select the seed data. We show that our proposed algorithm has an accuracy of 73.56% in classifying the financial texts into the different categories (“Accounting”, “Cost”, “Employee”, “Financing”, “Sales”, “Investments”, “Operations”, “Profit”, “Regulations” and “Irrelevant”) even when the training data is just 30% of the total data set. Additionally, the accuracy improves by an approximate average of 2% for an increase of the training data by 10% and the accuracy of our system is 77.91% when the training data is about 50% of the total data set. As a dictionary of hand chosen keywords prepared by domain experts are often used for financial text extraction, we assumed the existence of almost linearly separable hyperplanes between the different classes and therefore, we have used Linear Support Vector Machine along with a modified version of Label Propagation Algorithm which exploits the notion of neighborhood (in Euclidean space) for classification. We believe that our proposed techniques will be of help to Early Warning Systems used in banks where large volumes of unstructured texts need to be processed for better insights about a company.

Introduction

A lot of financial information is published in the form of annual reports, financial news, outlooks and statements. These sources provide information about the firm’s future prospects, it’s operations and investments. Any automated tool analyzing this voluminous data require deciphering the textual content of the data. Information about company’s performance, profit, liquidity extracted from annual reports have correlations with several performance factors like stock prices and future risks. These information are often scattered across several pages of the unstructured text data in the annual report. Researchers like Das and Chen (2007); Nassirtoussi, Aghabozorgi, Teh, and Ngo (2014); Sun, Belatreche, Coleman, McGinnity, and Li (2014) have already shown that opinions and sentiments extracted from unstructured data are indicators of corporate performances. However, almost all these previous works focused on classifying the sentences into one of the four classes like “positive”, “negative”, “neutral” or “irrelevant”. In the work (Feng, 2010), the author showed that the financial texts can also be classified into more specific categories like “Accounting”, “Financing”, “Cost”, “Employee”, “Profit” etc. But his work was focused in applying classifications for forward looking sentences. Moreover, the work (Feng, 2010) needs a large volume of carefully chosen corpus of human annotated data for training the system. Often, annotating a large volume of data by human annotators is a tedious process.

It should be noted that though sentiment analysis is an important aspect of Financial Text Mining, the objective of this work is different from sentiment analysis. As Early Warning Systems (EWS) are getting introduced in the retail banking sectors in India (with guidelines from the Reserve Bank of India), a lot of emphasis is put to develop financial text mining techniques for EWS systems to gather insights from the annual reports of the companies. Say, for example, a criteria placed by the Reserve Bank of India is to find if information about high debts by a company from a bank is not specified in its annual report. For the purpose, the EWS system needs to identify the relevant sentences from the annual report that are related to debt of the company. With such objective application in our mind, we attempt to study the problem of financial text classification into specific categories using a small set of randomly chosen seed data.

Accuracy of any classification model depends to a great extent on the amount of training data provided to the model. However, as EWS and financial text classification are relatively new, there are no data sources available in public domain to train a system for classifying financial text into specific categories. Therefore, we need to create techniques for annotating financial texts such that these techniques can start with small amount of labeled data and annotate large volume of unlabeled data. In Natural Language Processing (NLP), a technique of algorithm known as bootstrapping is widely used where a small set of data is carefully chosen and annotated by human expert and then a model is trained with the annotated data set. This model is then used to label the remaining data. The algorithm runs in several rounds. At each round a small set of the unlabeled data is labeled and the new labeled data along with the old training data forms the new training set. The reason for carefully selecting a good set of examples is to keep the accuracy in the earlier iterations higher. In this work, we propose an hybrid algorithm where the financial experts can randomly annotate a small set of the extracted texts for each class and then can use our algorithm to label the remaining data. The primary contributions of this paper are: (i) propose a way to adapt the bootstrapping technique for annotating large corpus of financial text; (ii) remove the criteria of carefully annotating good examples for each class; (iii) use data mining techniques to find good representatives for each class and then finding ideal unlabeled candidates for which the chances of getting correctly labeled are high at the current round of iteration.

Section snippets

Problem statement

Problem 1

Given a set D={Dl,Du} where Dl and Du are subsets of labeled and unlabeled sentences respectively extracted from financial documents, the task is to assign labels to the sentences in Du by learning patterns from Dl. We assume that (i)each observation xi ∈ Dl has at most one label; (ii) |Dl|=ρ×|D|. The parameter 0 < ρ < 1 is user-defined.

With respect to the problem setting, we assume that the set D is a rectangular (tabular) data where each row has a sentence (denoted by xi) and ρ fraction of

Prior art

The work that comes closest to our is the work by LI (2010) where the author tried to classify forward looking sentences into relevant categories. The author used Naive Bayesian classification technique and reported an accuracy of 69%. To the best of our knowledge, no other paper in the literature tried to study the problem of financial text classification since the paper by LI (2010).

Bootstrapping method in natural language processing

The bootstrapping method is a weekly supervised classification technique to label unlabeled data using a small set of labeled data. The steps of the bootstrapping methods are

  • 1.

    Carefully select a small number of sentences as seed data and label the sentences.

  • 2.

    Train a classifier L on the labeled data.

  • 3.

    Label the unlabeled data instances using L. Find the confidence scores of L for the newly labeled instances.

  • 4.

    Add the most confident newly labeled instances to the training data and repeat the process.

Data extraction

Our input data points are text sentences extracted from 732 company annual reports (of portable document format) for 147 companies operating in Bombay Stock Exchange(BSE) for at least a period of five years. The extraction of the text was an automated process during which the AFINN-111 (Nielsen, 2011) list was used. The AFINN-111 is a list of Financial English words rated for valence and were used as a filter to reduce the number of irrelevant extracted sentences. In addition to that, we use

Experimental results

We have measured the performance of our classification algorithms in terms of accuracy and F-measure. The accuracy of a classifier is defined as the fraction of correct predictions over testing data set (Zaki & Meira, 2014). The class-specific accuracy/precision of the classifier for class Cj is given as the fraction of correct predictions over all points predicted to be in class Cj while the class-specific coverage or recall of a classifier for class Cj is the fraction of correct predictions

Conclusions and limitations

Conclusions:

  • Conclusion 1: From the tables above, it seems the assumption of existence of linear hyperplanes is valid considering the performance of Label Propagation Algorithm with Linear SVM. However, the assumption of existence of orthogonal hyperplanes (as created in the Decision tree) is not appropriate.

  • Conclusion 2: The notion of neighborhood improves the performance of Label Propagation Algorithm in conjugation with Linear SVM.

Limitations and future work:

  • Limitation 1: The application of

References (14)

  • M. Bawa et al.

    LSH forest: Self-tuning indexes for similarity search

    Proceedings of the 14th international conference on world wide web, WWW 2005, Chiba, Japan, May 10–14, 2005

    (2005)
  • L. Buitinck et al.

    API design for machine learning software: Experiences from the scikit-learn project

    ECML PKDD workshop: Languages for data mining and machine learning

    (2013)
  • S.R. Das et al.

    Yahoo! for Amazon: Sentiment extraction from small talk on the web

    Management Science

    (2007)
  • L. Feng

    The information content of forward-looking statements in corporate filingsa naive Bayesian machine learning approach

    Journal of Accounting Research

    (2010)
  • G. James et al.

    An introduction to statistical learning: With applications in r

    (2014)
  • F. LI

    The information content of forward-looking statements in corporate filings- a naive Bayesian machine learning approach

    Journal of Accounting Research

    (2010)
  • C.D. Manning et al.

    Introduction to information retrieval

    (2008)
There are more references available in the full text version of this article.

Cited by (4)

  • A new graphic kernel method of stock price trend prediction based on financial news semantic and structural similarity

    2019, Expert Systems with Applications
    Citation Excerpt :

    Experiments show that feature selection and feature weighting methods have a substantial role in sentiment classification. Das, Mehta, and Subramaniam (2017) propose AnnoFin, a new method, to help classify financial texts into different categories and a high accuracy of 73.56% is achieved even when the training data is just 30% of the total data set. Zhu and Iglesias (2018) exploit different semantic similarity methods based on various semantic resources, and the experimental results have shown this method is more effective than text similarity methods when contextual information is rare.

  • Machine learning techniques for annotations of large financial text datasets

    2019, 25th Americas Conference on Information Systems, AMCIS 2019
View full text