Keywords

1 Introduction

Portfolio optimization in stock markets is the process of selecting a subset of assets that maintain an expected trade-off control between risk and return [16]. The portfolio selection process consists of finding, in a large collection of stocks, the participation (i.e. individual proportion) of each stock that minimizes the portfolio’s risk at a given portfolio return, or maximizes the portfolio’s return at a given risk [8]. This topic has been investigated by many researchers from many areas, such as optimization, machine learning (ML) and economics. Usually, portfolio selection algorithms use return and risk measures from a set of assets to make decision. This trade-off between risk and return is used to suggest a subset of assets that will be in the portfolio. Commonly used measures include price mean-return and price variance.

The correlation between asset prices is also important for portfolio management. It has been used in several works to create a network perspective that characterizes the complex structures in the stock market, also known as stock networks or financial networks [5, 15, 18]. The correlation is so important that some authors suggest the use of topological information derived from financial networks for portfolio management [13, 24, 32]. As an example, we can create portfolios by applying clustering algorithms or centrality measures in stock networks. However, despite its importance, we did not find works exploring the prediction of weighted links in financial networks to improve the performance of portfolio optimization algorithms. However, we found works using price and return forecasting to improve portfolio management results [19].

This study proposes a new approach to define the constants of the classic mean-variance analysis (MVA) from [16]. The new measures for expected return and asset correlation are calculated using by a new method. The proposed method, ML Weighted Link Prediction Analysis (MLink), provides information to the mathematical model of Markowitz, which is often used as basis for portfolio optimization. MLink was applied to dynamic temporal stock networks to induce a predictive model for return price forecast and weighted link prediction. We executed several experiments to assess the performance of our method. When it was compared with MVA, experimental results shows that MLink has \(56\%\) over MVA. In addition, MLink has better results than other variant methods. These variants use ARIMA, Median and Mean to return forecast combined with Weighted Link Prediction, named as ARIMA-MLink, Median-MLink and Mean-MLink, respectively.

2 Problem Definition and Related Works

The analysis of the behavior and interaction between assets in stock markets has been widely studied in the literature [3, 4, 30]. To formally describe it, consider i and j two distinct assets belonging to a specific set A. Let \(S_i\) and \(S_j\) be time series related to i and j, respectively. A weighted stock network can be represented by a graph \(G = (V,E)\), where V is the set of assets belonging to A, \(V \subset A\), and E contains all possible pairs of assets \(i,j \mid i, j \in V\). The relationship between i and j is measured using correlation metrics, assigning a weight w to each edge in E. A set of time ordered sequence of weighted stock networks graphs represents Dynamic Temporal Stock Networks [10].

In this work, we want to to answer the following question: given a set of graphs \(G_{1}, G_{2}, ..., G_{N}\) related to a temporal sequence of weighted stock networks in time N and \(G_{N + 1}\) a graph whose link weights were predicted using ML algorithms, can the combination of weighted link prediction by ML with mathematical models of portfolio optimization improve the trade-off between risk and return?

2.1 Portfolio Optimization Using Stock Network and Forecast

Several works apply ML or optimization algorithms for portfolio management [16, 17]. Despite its relevance, few papers investigate the use of stock network structure to portfolio management. In [24], the authors explore centrality in complex networks to improve portfolio selection process via targeting a group of stocks belonging to certain region of the stock market network. The work [32], like this paper, analyze time evolving stock markets by using temporal network representation. The authors also propose a portfolio selection tool using temporal centrality in stock networks. A portfolio optimization based on network topology, using cross-correlation of the daily price returns for the American and Chinese stock markets to create networks, is proposed in [13]. Taking into account the importance of correlation matrices, and the possible presence of noise values in these matrices, [22] introduces an approach that allows a systematic investigation of the effect of the different sources of noise in financial correlations in the portfolio and risk management context. We also analyzed different aspects related to noise presence, like the size of time series.

Other works use price and return forecasting to improve portfolio management [19]. In [8], a neural network is used to predict future stock returns. The prediction errors are used as a risk measure. [1] also shows that, for asset allocation decisions, the use of models able to predict return is better than using historical averages. This motivated the proposal of the new method MLink that uses prediction of weighted link formation to improve results of portfolio optimization models. The problem of predicting links in weighted networks is an extension of the problem of link prediction, whose main objective is the detection of hidden links or links that will be formed [28]. In weighted link prediction problem, it is necessary to predict both link and edge weight [14].

3 Methodology

The MLink framework has two steps: (i) ML to predict stock returns and weighted links in dynamic stock networks; (ii) mathematical portfolio optimization model using stock returns and predicted weighted stock networks as input. Figure 1 illustrates this framework.

Fig. 1.
figure 1

Framework of the MLink

This section introduces the data set (Sect. 3.1), the weighted link prediction method (Sect. 3.2), return forecast (Sect. 3.3) and the mathematical model (Sect. 3.4).

3.1 Market Data Set

The real data set used in this work was collected from the Brazilian Stock Exchange (BM&F Bovespa) between January 2018 e October 2018. We performed experiments using data regarding Bovespa Index assets (Ibovespa)Footnote 1. According to its dynamical structure, this theoretical portfolio has 65 assets currently, but we used 56 assets that remained in Ibovespa list during the entire period. The list of assets used in the experiments can be seen at link belowFootnote 2. In the first step, our objective is to identifying link weight between all pairs of assets in a complete weighted network. As traditional portfolio management algorithms, we are interested in reorganize the portfolio every day. Thus, the data were processed to obtain daily price time series. Note that our approach can be easily adapted for weekly, monthly or intraday strategies, according to the main purpose of the investor. It is important to emphasize that these data include an election period with high stock price variations.

3.2 Weighted Link Prediction in Dynamic Stock Networks

In this work, we use a modified version of the method proposed in [15] to create weighted dynamic financial networks. In this method, nodes of the graph represent assets and edges represent the relationship between them. This relationship is based on price time series correlation. Let \(S_i\) and \(S_j\) both price time series with length L regarding two distinct assets i and j. We can transform these non stationary price time series into a stationary return time series using the following equation:

$$\begin{aligned} Y_i = ln (P_i(t)) - ln (P_i(t - 1)) \end{aligned}$$

where, \(P_i(t) \in S_i\) is the closing price of asset i at day t. In terms of market definitions, this transformation represents the logarithmic return price. The average of this type of time series tends to be close to zero. Next, we applied the Pearson [23] correlation coefficient between all possible pairs of stocks logarithmic return time series \(Y_i\) and \(Y_j\):

$$\begin{aligned} \rho _{ij} = \frac{ cov(Y_i,Y_j)}{ \sqrt{var(Y_i) \cdot var(Y_j)} } \end{aligned}$$

The number of points L to measure the correlation between all possible pairs of stocks is another aspect to be considered. We performed experiments using \(L = \lbrace 10, 15, 20, 25, 30\rbrace \), as suggested by [31]. The correlation between assets is assigned to edges weights. This information is used as input to mathematical portfolio optimization algorithm. We used correlation \(\rho _{ij}\) between all possible pairs of stocks present in the Bovespa Index to create an adjacency matrix C. By definition, the elements \(\rho _{ij}\) are in the range of \(-1\) to 1, where \(-1\) corresponds to perfect anti-correlation, 1 corresponds to perfect correlation and 0 corresponds to absence of correlation. Consider that C represents a complete undirected stock network. Optimization models generally use similar correlation matrix as input to create a subset of assets with reasonable trade-off between return and risk. In this work, instead using the known correlation matrix at day t, we propose to use a step forward correlation matrix to improve the results of optimization portfolio. Our first main problem is to create machine learning (ML) algorithms able to induce models that can predict the weight of all edges in a future dynamic stock network \(\varDelta _{t + 1}\). For such, we created a method using ML to predict these values. This method uses three different sources of features to build models able to predict correlation values, represented by edge weights. In addition, since our networks are undirected, the number of weighted edges for each graph \(G = (V, E)\) is given by \(\vert E \vert = \vert V \vert * ( \vert V \vert - 1) / 2\). Thus, we have \(\vert E \vert = 56\ *\ (55) / 2 = 1540\) number of values to predict for each day.

In this work, we address the weighted link prediction problem as a regression task. For such, we propose the use of supervised ML algorithms. We use three sources of features to predict weighted link formation between assets in dynamic stock networks: (i) complex network derived features; (ii) domain derived features; (iii) return time series forecasts. Each example in the data set is labeled with correlation value between a pair of stocks i and j for a future given period. Next, we present the set of input features used to train the ML algorithms.

Network Derived Features are computed at each iteration using complex network statistical measures. The measures can be divided according to the level of analysis to be performed: at the node-level, where nodes represent assets, and at the link-level [21]. To create each example for the supervised learning data set, metrics related with node i and j are inserted for both nodes. Metrics related with edges are inserted calculating the measures between nodes i and j [20]. Consider \(\vert i \vert \) as node degree or number of edges.

  • Node-Level Derived Features: related to the position of the node within the overall structure of the complex network [21]. The following node metrics presented in Table 1 were used as stock i derived features.

  • Link-Level Derived Features: related to both contents and patterns of edges in complex networks [21]. The following link metrics presented in Table 2 were calculated between i and j stocks.

Table 1. Node-Level Derived Features
Table 2. Link-Level Derived Features

Domain Features are computed at each day using a set of Technical Analysis Indicators (TAI). An indicator can be defined as a series of data points derived from assets price information applying a mathematical formula [26]. These metrics were calculated using the same set of daily price time series used to create the networks. The relationship between these domain features and the graph networks can be seen as characteristics to describe nodes in a complex network. The domain features used are: Relative Strength Index (RSI), Simple Moving Average (SMA), Exponential Moving Average (EMA), Moving Average Convergence/Divergence (MACD), Average Directional Movement Index (ADX), Aroon Indicator (Aroon), Bollinger Bands (BB), Commodity Channel Index (CCI), Chande Momentum Oscillator (CMO), Rate of Change (ROC) and Average True Range (ATR). More information regarding how to calculate these features can be found in [27] and [26].

To create each example for the ML algorithm training data set, the subset of network derived features is concatenated to the subset of domain derived features. Thus, TAI related to assets i and j are inserted for both nodes.

3.3 Logarithmic Return Forecast Using Machine Learning

Another issue to be considered in our MLink framework is the stock logarithmic return. According to the mathematical optimization model, which will be presented in the following section, it is necessary a measure of return and risk of each asset. Usually, the mean of the returns and the standard deviation are used in optimization algorithms to estimating return and future risk, respectively. For this, we used a ML algorithm (MLP) to forecast the logarithmic return of all assets. For comparative analysis, we also used three statistical methods to estimate assets return. These methods are:

  • MLP - Multilayer Perceptron neural network - has powerful approximation capabilities and its self-adaptive data driven modelling approach allow them great flexibility in modelling time series data [12]. The MLP network used has one hidden layer with 5 neurons and is trained using the resilient backpropagation algorithm, a fast weighted update mechanism to feedforward artificial neural networks [25].

  • ARIMA - Autoregressive Integrated Moving Average - models are fitted to the time series data to predict future points in the series [6].

  • Mean - Simple mean price return using time series data.

  • Median - Simple median price return using time series data.

Based on each method result, we then calculated the risk measure, which is given by standard deviation of logarithmic return time series. These return forecasts are used as input to mathematical portfolio optimization model and also as input feature to Mlink to predict weighted link in the stock network. To create each example for the supervised learning data set regarding weighted link prediction in MLink, these forecasted return values of each pair of assets i and j are concatenated to input set of features.

3.4 Mathematical Model to Portfolio Optimization

In [16] Markowitz proposed the first mean-variance model that served as the basis for the Modern Portfolio Theory in financial management. In this theory, an investor wishes to distribute an initial wealth in a set of investments in order to minimize the risk and maximize the return. Naturally, these two objectives are conflicting because if there is a minimum risk investment and maximum return the decision is trivial. Usually the higher the risk the higher the expected return.

For this, [16] proposed a bi-objective quadratic programming model for find the Markowitz efficient front. In mathematical terms: given n assets with return vector \(\mu \in \mathbb {R}^n\), estimated covariance matrix \(\sigma \in \mathbb {R}^{n\times n}\), and the invested fraction of each asset in optimal portfolio is \(x \in \mathbb {R}^n\). To computing the Markowitz efficient front is given by maximizing expected return for a given level of the risk (mean-variance model 1) or minimizing the risk for a given level of the expected return (mean-variance model 2).

$$\begin{aligned} \text {Maximize }\quad&\mathcal{E}= \sum _{i = 1}^n x_i\mu _i \nonumber \\ \text {Subject to: }\quad&\sum _{i = 1}^n \sum _{j = 1}^N x_ix_j\sigma _{ij} \le v^2, \\&\sum _{i = 1}^n x_i = 1, \nonumber \\&x_i \ge 0, \; \forall \; i = 1, ..., n. \nonumber \end{aligned}$$
(1)

In mean-variance model 1 the expected value of the portfolio (\(\mathcal{E}\)) is maximized, subject to a minimum variation (\(v^2\)), the sum from fraction of the portfolio is equal 1, and no investment can be negative.

$$\begin{aligned} \text {Minimize} \quad&v^2 = \sum _{i = 1}^n \sum _{j = 1}^n x_ix_j\sigma _{ij} \nonumber \\ \text {Subject to:}\quad&\sum _{i = 1}^n x_i\mu _i \ge \mathcal{E}, \\&\sum _{i = 1}^n x_i = 1, \nonumber \\&x_i \ge 0, \; \forall \; i = 1, ..., n. \nonumber \end{aligned}$$
(2)

In mean-variance model 2 the objective is minimize the standard deviation (\(v^2\)) subject to a level of return (\(\mathcal{E}\)). Note that, if we vary the of standard deviation (\(v^2\)) in mean-variance model 1 or the desired level return (\(\mathcal{E}\)) in mean-variance model 2 we can build the Markowitz efficient front, i.e. the trade-off between risk and expected return.

[11] examines an alternative formulation for the problem using a measures of absolute and relative risk aversion. Consider u is a von Neumann-Morgenstern utility function. The absolute risk aversion is defined by \(R_a = \frac{u''(w)}{u'(w)}\), where w is the valuation of the portfolio. The formulation result is presented below:

$$\begin{aligned} \text {Maximize}\quad&F = \sum _{i = 1}^n x_i\mu _i - \frac{R_a}{2}\sum _{i = 1}^n \sum _{j = 1}^N x_ix_j\sigma _{ij} \nonumber \\ \text {Subject to:}\quad&\sum _{i = 1}^n x_i = 1 \\&x_i \ge 0, \; \forall \; i = 1, ..., n. \nonumber \end{aligned}$$
(3)

For the model 3 the objective is maximized the expected return of portfolio less \(\frac{R_a}{2}\) time standard deviation of portfolio. According to [11] the utility functions (\(R_a\)) of negative exponential energy generate very risk-averse portfolios. Thus, the efficient boundary can be obtained for values of \(R_a> 0\). The empirical results indicate that: risky portfolios have values of \( R_a \le 2 \); moderate risk portfolios have \( 2 \le R_a \le \) 4; risk-averse portfolio have \( R_a \ge 4\).

4 Experiments

In this section we present results separately. First, we present results regarding to price return forecast using different lengths of L. Second, we present weighted link prediction results related to stock networks. We used the same time series length L to perform both return forecast and weighted link prediction experiments. Finally, we present financial results comparing MLink with Ibovespa, MVA and the three proposed variants ARIMA-MLink, Median-MLink and Mean-MLink.

4.1 Machine Learning for Return Forecast

In this section we present a set of experimental results using different methods to asset return forecast. We used different time series lengths \(L = {10, 15, 20,25, 30}\) as input to time series forecast algorithms. Figure 2 shows a comparative result of Mean Absolute Error (MAE) evaluation metric [29]. For each length L, we plotted the Cumulative Distribution Function (CDF) related to each return forecast model. Thus, we can see the behavior of each model when different time series lengths are used.

Fig. 2.
figure 2

Return forecast results for each size of L

An important result presented in Fig. 2 is that Mean and Median have great MAE values compared with ML model (MLP). ARIMA has better results in small L values, but in \(L = 30\) present the worst result. Besides that, MLP has good results for large values of L.

4.2 Machine Learning for Weighted Link Prediction

This section presents results related to weighted link prediction in dynamic stock networks. To perform these experiments, the training and test set was built using sliding window [26]. For such, we used 30 daily graph snapshots in the training set. The test set corresponds to the next trading daily data. The sliding window moves one day ahead to create new training and test set.

We used XGboost [7] as main ML model to weighted link prediction. It is a fast, a highly effective and widely used machine learning method. We did not perform an exhaustive search for model parameters because this is not our main objective. Our intention is to show how predictive a machine learning model can be using the set of features that we proposed. The set of model parameters are:

  • booster = “gbtree”;

  • objective = “reg:linear”,

  • eta = 0.05,

  • max_depth = 2,

  • min_child_weight = 100,

Figure 3 presents comparative results using CDF for both MAE and Root Mean Square Error (RMSE) evaluation metrics.

Fig. 3.
figure 3

Weight link prediction using different sizes of L

Figure 3 presents significant weighted link prediction results using the proposed method. This comparison between the behaviors of the machine learning model for the different values of L allows us to visualize that, for greater L values, the model can better predict the values of the edges weights (correlations). A possible explanation is that: the greater the size of L, more stable the financial network tends to be, facilitating the prediction of edges weights. With smaller values of L, the network tends to be more unstable. Note that the value of L influences both return forecast and weighted link prediction. Considering that edge weights are between \(-1\) and 1, MAE results for \(L = 30\) are almost \(95\%\) under 0.05, which is very expressive in terms of weighted link prediction. At each day, the model predicts 1540 edge weights (correlations).

4.3 Portfolio Optimization Experiments

This section presents a comparative experiment using \(L = 30\), which is the best return forecast and weighted link prediction results. We executed the Markowitz model using results of weighted link prediction and all methods applied to return forecast. Figure 4 shows a financial simulated gross return for each approach. Each execution uses 84 trading days. For a utility functions (\(R_a = 2\)) a threshold value between a moderate risk and risky.

Fig. 4.
figure 4

Comparing accumulated return for each approach

Figure 4 shows that our MLink method outperforms MVA in over \(56\%\) comparing accumulated gross return. ARIMA-MLink, Median-MLink and Mean-MLink also has better results than MVA and Bovespa Index. This is an impressive result in terms of financial return.

5 Conclusions and Future Works

This study proposed determining the constants of the MVA from [16] using machine learning and a new weighted link prediction in stock networks defined in this paper as MLink. Portfolio optimization models use data from the past to create portfolios with a good trade-off between return and risk. Using the correlation between asset price series in the stock networks, when vertices represent assets and edges represent the correlation between them, the proposed method can predict the weights of all edges in dynamic stock networks. Like the MVA, the proposed method also use return forecasts. The experimental results show that using both return forecast and weighted link prediction data, the proposed method performance is superior to the performance obtained by the MVA. The experiments show that the MLink capital increases almost \(41.34\%\) in 84 days, a difference of \(56.68\%\) for the MVA (which reduced the capital in 9.79%).

These findings open a range of possibilities for future works. Several analyzes will be carried out, such as the use of different optimization models and run experiments using known market data sets for portfolio optimization. In addition, we can use other ways of predicting edge weights, such as tensor or deep learning, trying to improve weighted link prediction results. In addition, we can make available the created data set for other researchers evaluate their optimization models.