Abstract
With the evolution of China's market economy, the securities market is increasingly anchoring a pivotal role in the nation's economic landscape. Consequently, stock trend forecasting has garnered heightened attention among scholars and practitioners. This research pioneers the use of multimodal information to predict stock market fluctuations. Based on our experimental results, LSTM + Transformer performs better in handling multimodal data for stock movement prediction tasks regarding accuracy, F1-score, precision, and recall. Additionally, we employed the Granger causality test and Impulse response test to investigate the causal relationships between sentiment and stock trends, as well as the interplay between COVID-related indicators and stock trajectories. We identified discernible causal links between sentiments, COVID indicators, and stock trends for select pharmaceutical stocks. Our findings can provide valuable guidance for investors and market regulators, especially within the pharmaceutical industry. Understanding investor sentiment and the impact of the pandemic on severity can assist in effective stock commentary management and improve investment strategies.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
As the global economy advances and the financial system evolves, the total market capitalization of the worldwide stock market continues to grow. Stock market prediction has been a focal point of scholarly research [9, 30, 45]. Accurate stock market predictions hold practical significance, as they offer the potential for higher investment returns [10, 42, 54]. Nevertheless, forecasting the stock market remains a formidable task due to the inherent randomness in stock movements and various influencing factors such as company performance, government policies, related media news, global pandemics, and more, making it an exceptionally challenging problem.
Stock price fluctuations are significantly affected by related information, broadly categorized into data information describing stocks, textual information about stocks, and other information that may impact the stock market. Understanding the relationships between stock-related information and stock price fluctuations can provide robust support for stock trend predictions [45, 49]. Media reporting can influence investor sentiment and affect stock prices [40]. The global COVID-19 pandemic, causing massive lockdowns [16, 17, 26, 55], has also disrupted financial markets and the international economy [58]. Asian financial markets faced challenging weekends, potentially causing destructive consequences for the worldwide economy.
The fluctuation of stock market prices reflects the fundamental condition of the industry market, which, in turn, affects individual investors' decisions and, ultimately, the entire country's economic situation. Accurately predicting stock price trends can assist investors in making informed investment decisions, reducing losses, and achieving higher returns. Moreover, it can enable governments to perform macroeconomic regulation based on industry stock price trends, promoting the healthy development of industries and ensuring national economic stability. Therefore, research on stock market trend prediction methods holds significant theoretical and practical value.
The pharmaceutical sector plays a critical role in global health, and various factors, including market sentiment and external events such as pandemics, influence its stock price trends. This research aims to utilize historical stock data, sentiment data, and COVID-related data to predict stock price trends and investigate their interrelationships, particularly in the pharmaceutical sector.
This research focuses on the pharmaceutical sector of the Chinese stock market, utilizing a deep learning-based approach that demonstrates advantages in handling large datasets and time series prediction. The study integrates stock price, news text, and COVID-19-related features to investigate stock market trends during crisis-driven demand situations. Unlike most research on steady-state markets, we analyze how the pharmaceutical industry responds to sudden, extreme pressures by selecting COVID-19 as the case. Multimodal information fusion in such research is often overlooked, with many studies fixating solely on stock price features and stock-related news, neglecting broader influencing factors. We explored the effectiveness and relationship between the stock price prediction and the COVID-related indicators. Furthermore, this article conducts a comparative analysis to validate the differences in stock trends, investor sentiment among stocks with varying popularity levels within the same industry, and the impact of COVID-19-related features. We also compare advanced classification algorithms, revealing the suitability of our proposed method for this multimodal classification task.
This research delves into several crucial research questions. Firstly, we question the efficacy of fused multimodal stock market data in predicting short-term stock price trends within the pharmaceutical industry. Secondly, we interrogate the extent of financial textual and covid-related data's impact on stock price trends, particularly within the pharmaceutical industry.
-
RQ1: Can these diverse information sources be fused to provide a robust foundation for forecasting these trends?
-
RQ2: How much do these information sources influence stock price movements?
2 Literature review
2.1 Methodologies in stock trend prediction
Stock market trend prediction is a classic research problem that has garnered extensive attention from financial experts and computer scientists. In the past, researchers sought effective stock prediction models by employing traditional time series algorithms and conventional machine learning techniques to forecast stock trends. However, standard time series algorithms, which assume the data has linearity [34], often need more processing capabilities to handle complex and nonlinear stock data. In contrast, conventional machine learning algorithms exhibit strong nonlinear mapping capabilities that compensate for the shortcomings of traditional time series approaches and demonstrate notable performance in stock trend prediction. Nonetheless, they still struggle to capture the temporal dependencies in stock prices. In recent years, the rapid development of deep learning technology has led to groundbreaking research outcomes across various domains. Deep learning methods have been widely adopted within stock market prediction and have shown superior predictive performance compared to traditional machine learning techniques.
Previous research on stock market trend prediction has been centered around individual stock forecasts for capturing stock movement patterns from independent historical information to predict future trends accurately. Over time, the field of stock market prediction has transitioned from using traditional statistical methods to embracing neural networks and artificial intelligence techniques. The ARIMA (AutoRegressive Integrated Moving Average) model [46] is a commonly used time series statistical model widely employed in early stock market forecasting. Gao and Feng [8] introduce an enhanced stock price forecasting ARIMA (autoregressive integrated moving average) method combining B-spline basis expansion and model averaging. It utilizes intraday prices, improving ARIMA's accuracy and robustness across industries.
However, traditional statistical methods have proven ineffective in achieving accurate predictions when dealing with large and volatile stock datasets. Shankar et al. [35] treated stock market prediction as a classification problem, aiming to forecast the future trend of stock prices rather than specific price values. They employed a random forest approach for trend prediction and found that it outperformed other algorithms. Long Short-Term Memory (LSTM) networks are classical time series prediction algorithms, and Moghar and Hamiche [27] demonstrated that LSTM's predictive performance is superior to traditional machine learning algorithms. Ho et al. [12] compare the ARIMA, Neural Network, and LSTM models to predict Bursa Malaysia's stock prices during the COVID-19 pandemic, with the LSTM achieving the highest accuracy. Technically analyzed input features with an elevated profitability rate are incorporated in Song et al. [38] work to achieve improved performance. Yang and Wang [52] employed Bidirectional LSTM (BiLSTM) to predict the closing prices of the Shanghai and Shenzhen 300 Index, affirming the practical value of deep neural networks in financial forecasting. Ding et al. [6] introduced a Hierarchical Multi-Scale Gaussian Transformer (HMG-TF) model for short-term stock trend prediction. Lee et al. [19] introduce a method to enhance stock price predictions by fusing stock data, macroeconomic indicators, and temporal factors using a multimodal fusion transformer. Employing an early fusion strategy, this approach outperforms traditional models, offering a comprehensive solution to stock direction prediction through integrated data sources and advanced neural techniques.
2.2 Impact of investor sentiment on the stock market
Thakkar and Chaudhari [42] state that recent finance-related research has amalgamated diverse modal data with stock price information. The rapid development of deep learning technology in natural language processing has laid the foundation for mining textual information. A substantial body of research indicates a connection between investor sentiment and stock market returns. Textual descriptions of stocks can influence investor emotions. More accurate stock price predictions can be achieved by mining textual data describing stocks, extracting text features related to stock price fluctuations, and integrating them with stock price features as inputs into prediction models.
With the ubiquitous access of social media, public sentiments spread fast and wide, especially during the massive lockdown of the COVID-19 pandemic [39]. Mudinas et al. [28] examine sentiment analysis in financial forecasting, revealing inconsistent patterns between sentiment emotions and stock prices, but integrating emotions can enhance prediction models. Li et al. [20] built a stock prediction system using technical stock price indicators and sentiment vectors from news articles. The proposed approach, which incorporates sequential information within market snapshot series through a layered deep learning model, outperforms baselines in prediction accuracy. Nti et al. [31] attempted to predict Ghana's stock market trends using sentiment analysis with an MLP architecture to consider different time windows for predicting future stock prices and found that investors can rely on information published in financial news, Twitter, and forums to execute investment strategies. Ho and Huang [13] use a multimodal collaborative network that combines candlestick-chart and social-media data for stock trend predictions, achieving higher accuracy than single-network models. Lazzini et al. [18] reveal a significant Granger causality between tweets and the FTSE MIB closing prices during COVID-19, highlighting the strong correlation between social media sentiment and stock market volatility. Hu et al. [15] find that shocks to investor sentiment significantly impact stock market returns, with more significant effects in bullish markets than bearish ones, underscoring the profitability linked to investor sentiment.
2.3 Impact of COVID-19 on the stock market
In the annals of contemporary economic history, few events rival the profound disruption caused by the COVID-19 pandemic. This disruption was acknowledged universally on March 11, 2020, when the World Health Organization (WHO) characterized COVID-19 as a global pandemic [48]. Such a designation did not merely signify the health implications but alluded to vast consequences for the global economic system. After this declaration, markets worldwide experienced turmoil. Risk-averse investors, often lauded for their cautionary approach, faced rapid evaporation of asset value attributable to the virulence and expeditious propagation of the virus. In this uncertain vortex Field, diverse financial markets —stock, commodity, or debt— are ensnared [7].
The academic evolution in comprehending the fiscal repercussions of COVID-19 has been, in itself, dynamic. Initially, the implications were understood simplistically, but the intricacies emerged with the ongoing accumulation of empirical data and rigorous analyses. The term "COVID-19 shock" entered the lexicon of financial literature, not merely as a transient phrase but as a representative of a seismic shift with long-term connotations [5]. Econometric studies have sought to quantify the relationship between epidemiological parameters (such as infection and mortality rates) and their consequences on stock market performance. Sharma et al. [36] and Zhang et al. [56] unequivocally elucidated that regions with heightened health distress reflected consequential downturns in stock market indices. However, anomalies persist. For instance, the Chinese stock market, though initially perturbed, exhibited resilience, a phenomenon attributed to timely and efficacious policy measures [14]. Recognizing the importance of sentiment in economic frameworks, scholars endeavored to gauge collective psychology during these trying times. Tools such as the global fear index for COVID-19 were devised, serving not merely as academic curiosities but as pragmatic indicators guiding investment strategies [11, 24].
One of the dire consequences of such a global economic perturbation is the specter of financial risk contagion. The virality of financial downturns, in many ways mirroring the spread of the contagion itself, propagated disruptions across the financial architecture. These disruptions manifested as negative yields, amplified uncertainty, and escalated market volatility [21, 51]. In the realm of emerging financial assets, the pandemic's influence was discernible as well. Ampountolas [2] employed an intricate two-stage multivariate volatility Exponential Generalized Autoregressive Conditional Heteroskedasticity (EGARCH) model, incorporating a dynamic conditional correlation (DCC) approach, to decipher the interplay between COVID-19 and the volatility within the cryptocurrency and stock domains. Their findings underscored substantial spillover effects and accentuated the heterogeneity of asset risk profiles during this period.
Traditional statistical methods often need to catch up on capturing the complex and nonlinear relationships in stock market data. Most existing studies still need to fully exploit the potential of combining multiple data sources, especially during the post-COVID era. While there is growing interest in understanding the impact of investor sentiment and COVID-19 on the stock market, comprehensive studies that integrate COVID-19-related data with financial and sentiment data for prediction purposes and impact analysis still need to be included.
These identified gaps inspire our research garnered from prior studies on stock market prediction, particularly in the context of investor sentiment and the impact of COVID-19. We are adopting advanced deep learning techniques to address the limitations of traditional statistical methods and machine learning algorithms, such as LSTM networks and Transformer models. Through multimodal information fusion, we combine stock time series, textual information, and COVID-19-related data to capture the intricate relationships influencing stock price fluctuations. Our comprehensive approach aims to provide valuable insights for investors, financial analysts, and policymakers in making informed decisions, optimizing investment strategies, and managing risks amid the dynamic stock market landscape.
3 Data and methodologies
3.1 Data collection and preparation
This research selected eight Chinese pharmaceutical stocks with representative differences in popularity as research objects and obtained their investor evaluation and stock information data through crawler technology, as shown in Table 1. The stock time-series data was fetched from the Tushare website. The stock news from Eastmoney, Yuncaijing, and 10jqka was chosen as the target information source. COVID-related information was downloaded from Our World In Data [25] and contains infectious cases worldwide. This research covered the interval from January 1, 2020, to June 9, 2023. We confirmed no missing values for the chosen period in any datasets to ensure the integrity and completeness of the data used in our analysis, allowing us to perform robust and reliable assessments without additional data imputation or missing value-handling techniques.
Figure 1 depicts the closing price of each stock. The stock PZH shows the highest closing prices among all the stocks in the given period. It also offers a general upward trend. The stocks CCGX, KLY, and PZH all show significant volatility, fluctuating prices widely over the given period. The remaining stocks, including CHKJ, FXYY, HLSW, JAYL, and YLYY, appear to have lower closing prices. These stocks show less volatility than CCGX, KLY, and PZH. Most stocks show a general upward trend, indicating their prices have increased. However, there are also periods of downturn where the prices decrease.
We encoded the sentiments of financial news text using AutoTokenizer and AutoModelForSequenceClassification from HuggingFace. Specifically, we employed a pre-trained sentiment classification model.Footnote 1 The sentiment scores from zero to four, where a score closer to zero indicates more negative sentiment, and a score closer to four indicates more positive sentiment. To convert the text into sentiment scores, we followed three steps. First, we used AutoTokenizer to tokenize the financial news texts, which converts the raw text into a suitable format for the model. Then, we used AutoModelForSequenceClassification to process the tokenized text. The model outputs a set of logits representing the likelihood of each sentiment category. Finally, the logits are processed to obtain the final sentiment score by selecting the index of the maximum logit value, which corresponds to the predicted sentiment category (zero to four). Table 2 describes the general information of each stock's sentiment. All the selected stocks' sentiments tend to be positive, with a reasonable range of variation from zero to four.
Figure 2 showcases the new cases and new deaths in China. At the chart's beginning, we observe a significant spike in new cases, corresponding to the pandemic's early days when the virus was identified in Wuhan, China. After this spike, there is a noticeable decline in new cases, suggesting that measures China took in the early stages effectively controlled the spread. The red line, which represents new deaths, remains relatively low throughout the period, indicating that the death rate was controlled while new cases were identified. After the initial wave, the number of new cases in China has remained relatively low, with minor spikes, indicating that China has effectively managed outbreaks and prevented large-scale resurgences.
3.2 Stock trend prediction model
We use the LSTM [37] and Transformer [44] to make the prediction. These are proven deep learning methods known for their efficiency in sequence prediction tasks. It is also necessary to compare with some other best-performing models. Support Vector Machine (SVM), Random Forest, and Naïve Bayes are the selected models in this study.
3.2.1 LSTM
LSTM networks signify advanced Recurrent Neural Networks (RNNs) [53]. Introduced to overcome certain limitations of the traditional RNN, this new architecture incorporated an additional component, the 'forget gate,' to further enhance its functionality. The advent of the improved LSTM successfully addressed the pervasive issue of 'vanishing gradients' that had previously hampered the process of model training within RNNs. This feature has allowed LSTM to learn and remember dependencies of various time spans within a sequence, marking it as one of the most successful adaptations of RNNs to date. Its applications span a diverse range of fields due to its versatile structure.
3.2.2 Transformer
The Transformer is an influential deep-learning model developed upon the attention mechanism [22] built around two integral parts, the encoder and the decoder, containing six uniform network layers. Each encoder layer integrates a Multi-Head Attention mechanism and a Positionwise FeedForward Network (FFN). The Multi-Head Attention component calculates the interrelationships among the input features, creating the encoded vectors [10, 41]. The FFN, deploying a fully-connected layer, then proceeds to refine these encoded vectors [3]. In contrast, the decoder comprises three sublayers: Masked Multi-Head Attention, Multi-Head Attention, and FFN. Masked Multi-Head Attention ensures that the output at a specific time instance is solely related to the output from preceding instances.
3.2.3 SVM
SVM emerged as an innovative machine rooted in the statistical learning theory [33]. It approximates the method of structural risk minimization and is particularly adept at handling small-sample data. SVM is lauded for its rapid computation and robust generalization capabilities, finding widespread application across numerous domains. Identifying optimal hyperplane parameters is crucial to enhancing prediction accuracy [57].
3.2.4 Random forest
Random Forests, a celebrated ensemble learning technique, have gained extensive traction in data classification and non-parametric regression [47]. Constructed on the bedrock of decision trees, the Random Forest method embodies the principles of bagging [23]. Further, it incorporates a dimension of randomness by selecting attributes during the decision tree's training phase. By employing mechanisms of voting or averaging, Random Forest amalgamates the results from each tree to deduce a consolidated prediction.
3.2.5 Naïve Bayes
The Naive Bayes classification approach, revered as a cornerstone in the realm of probabilistic algorithms, stands predicated upon two foundational pillars: the celebrated Bayesian theorem and the assumption that features operate under conditions of independence [1]. Drawing inspiration from the Bayesian probability structure, the algorithm diligently calculates the posterior probability based on the distribution probabilities inherent to input–output relationships. The objective becomes the category identification and subsequent extraction with the peak posterior probability value.
3.3 Construction of indicators
The stock market, particularly in sectors as dynamic and ever-evolving as the pharmaceutical industry, operates not merely on complex data but also sentiment, external events, and subtle market dynamics. Predicting stock price trends requires amalgamating these diverse data streams into a coherent set of indicators that can capture the nuanced interplays of these factors inspired by Baker and Wurgler [4], Okunev and White [32], and Thomsett [43]. This part delves into the indicators explicitly constructed for this task. Each indicator, succinctly coded for efficient computational analysis, reflects a particular market behavior facet. From understanding simple price shifts to gauging the impact of external events like the COVID-19 pandemic, these indicators offer a comprehensive lens through which the stock market's pulse can be felt and predicted. We will unpack its mathematical formulation, the rationale behind its construction, and its intended purpose in the larger prediction model.
3.3.1 Stock market trend indicators
-
1)
Volume-Adjusted 6-Day Return (VA6DR)
It represents how much the stock has returned over six days, factoring in the volume of stocks traded. This feature gives a clearer picture of the stock's momentum when combined with the volume. It takes the difference between the current close price and the close price from six days ago, normalized by the close price from 6 days ago. This ratio is then multiplied by the volume to get a volume-adjusted return. Let Closet be the close price at time t, Closet-6 be the close price from six days ago, and Volumet be the stock's volume at time t. The formula is as follows:
$$VA6DR=\frac{{Close}_{t}-{Close}_{t-6}}{{Close}_{t-6}}\times {Volume}_{t}$$ -
2)
Upward Movement Frequency (UMF)
UMF measures the bullish sentiment over the past 20 days. The number of times in the last 20 days that the stock's closing price, denoted as Closet-i, was higher than the previous day's, denoted as Closet-i-1. This is then represented as a percentage, highlighting the bullish consistency over a short period. The formula is:
$$UMF=\frac{{\sum }_{i=1}^{20}\left({Close}_{t-i}>{Close}_{t-i-1}\right)}{20}\times 100$$ -
3)
Price Range Momentum (PRM)
PRM compares the aggregated bullish price movements (high-open) against the aggregated bearish price movements (open-low) over 50 days. The ratio provides insight into the bullish vs. bearish sentiment over an extended period, which can be essential for trend analysis. Let Hight-i be the highest price, Lowt-i be the lowest price, and Opent-i be the open price at time t-i. The formula is given by:
$$PRM=\frac{{\sum }_{i=1}^{50}\left({Hi\text{g}h}_{t-i}-{Open}_{t-i}\right)}{{\sum }_{i=1}^{50}\left({Open}_{t-i}-{Low}_{t-i}\right)}\times 100$$ -
4)
Power-Weighted High-Low Differential to VWAP (PWHLD)
It is a metric that provides a perspective on the stock's value by comparing its high and low prices with its Volume Weighted Average Price (VWAP). The difference between a power-weighted combination of high and low prices and the VWAP. It gives an idea about the stock's valuation concerning its average price weighted by volume. \(\alpha\) is a constant exponent hyperparameter, Amoutt is the total trading amount of the stock at time t, and Volumet is the total trading volume of the stock at time t. The formula is:
$$PWHLD={\left({Hi\text{g}h}_{t}\times {Low}_{t}\right)}^{a}-\frac{{Amount}_{t}}{{Volume}_{t}\times 10}$$ -
5)
20-Day Amount Moving Average (20DAMA)
The moving average offers a smoothed representation of the stock's trading amount over the past 20 days. The moving average of the amount is over 20 days. This provides a lagging indicator of the stock's average trading amount, which can help determine liquidity trends. The formula is presented as follows:
$$20DAMA=\frac{1}{20}\sum_{i=1}^{20}{Amount}_{t-i}$$
3.3.2 Intraday trading indicators
-
1)
Day Pattern Analysis (DPA)
It identifies typical bearish day patterns. This feature returns a value of −1 if the condition (high-open) times (close-low) is less than (open-low) times (high-close); otherwise, it returns 0. This indicates the shape of the candlestick for the day and suggests potential bearish patterns. The formula is as follows:
$$DPA=\left\{\begin{array}{c}\begin{array}{cc}-1& if \left({Hi\text{g}h}_{t}-{Open}_{t}\right)\left({Close}_{t}-{Low}_{t}\right)<\left({Open}_{t}-{Low}_{t}\right)\left({Hi\text{g}h}_{t}-{Close}_{t}\right)\end{array}\\ \begin{array}{cc}0& otherwise\end{array}\end{array}\right.$$ -
2)
Overnight Return (OR)
It indicates the stock's price change from the previous day's closing value to its current day's opening value, i.e., the ratio of the difference between the opening price and the previous day's closing price to the previous day's closing price. This feature can be instrumental in understanding gaps or potential market reactions to after-hours news.
$$OR=\frac{{Open}_{t}}{{Close}_{t-1}}-1$$ -
3)
Short-Term Closing Price Comparison (STCPC)
It assesses whether the current stock's price is above or below its recent average, i.e., the ratio of the 6-day moving average closing price to the current price. It is a valuable metric to quickly gauge if the stock is overbought or oversold in the short term.
$$STCPC=\frac{\frac{1}{6}{\sum }_{i=1}^{6}{Close}_{t-i}}{{Close}_{t}}$$ -
4)
Intraday Price Momentum (IPM)
It offers a perspective on the stock's intraday price movement, i.e., the ratio of the difference between the closing and opening prices to the day's range. This provides insights into intraday trading sentiment and strength. \(\gamma\) is a small constant added to the denominator to avoid division by zero and to stabilize the calculation. The formula is:
$$IPM=\frac{{Close}_{t}-{Open}_{t}}{{Hi\text{g}h}_{t}-{Low}_{t}+\gamma }$$
3.3.3 Sentiment and COVID-19 indicators
-
1)
Sentiment Differential Index (SDI)
The SDI is an investor sentiment indicator based on investor intelligence data, reflecting the difference between bullish and bearish proportions, aiding in market outlook assessment. Our study employed the AutoTokenizer model from Hugging FaceFootnote 2 for sentiment analysis. This pre-trained model, adept in processing Chinese text, is based on advanced neural network architectures. It was selected for its proven accuracy in semantic classification, a crucial aspect of interpreting financial news sentiment. Since there will be multiple news within a single day, we aggregate the corresponding sentiment scores by calculating the daily mean for both bullish and bearish sentiments. Let MBull represent the average of bullish sentiment scores, indicating those classified as three or four on our sentiment scale, and let MBearish represent the average of bearish sentiment scores, indicating those classified as zero or one on our sentiment scale. The formula is given by:
$${M}_{Bull}=\frac{1}{{N}_{Bull}}\sum_{i=1}^{{N}_{Bull}}{Sentiment}_{Bull,i}$$$${M}_{Bearish}=\frac{1}{{N}_{Bearish}}\sum_{i=1}^{{N}_{Bearish}}{Sentiment}_{Bearish,i}$$where NBull is the number of news articles classified as bullish on a given day, and SentimentBull,i is the sentiment score of the i-th bullish news article. Similarly, NBearish is the number of news articles classified as bearish on a given day, and SentimentBearish,i is the sentiment score of the i-th bearish news article.
The SDI is then calculated using the following formula:
$$SDI=\frac{{M}_{Bull}-{M}_{Bearish}}{{M}_{Bull}+{M}_{Bearish}}$$This approach ensures that the sentiment scores are normalized and comparable across different days, providing a balanced measure of the overall market sentiment.
-
2)
Indirect sentiment indicators
The turnover rate and stock amplitude were chosen as indirect sentiment indicators. Stock amplitude measures the variability in stock activity, reflecting the level of uncertainty or excitement in the market. A higher stock amplitude indicates more significant variability in the turnover rate, which can be interpreted as higher market sentiment or activity. Previously, few scholars have used stock amplitude as a sentiment indicator variable. STD represents the standard deviation, and Turnovert,t-window represents the turnover rate over a rolling window of seven days. The formula is as follows:
$$Stock Amplitude=STD\left({Turnover}_{t,t-window}\right)$$ -
3)
COVID Severity
The COVID Severity indicator aims to provide a normalized measure of the daily severity of COVID-19 cases. Daily cases' volatility and magnitude differences between years (e.g., 2020 and 2023) highlighted the necessity for a normalized metric. This index gives context to the raw daily numbers, allowing for a more comparative and insightful analysis of how the daily reported cases fare against a recent historical average. The formula effectively captures the relative severity of daily COVID cases by comparing the daily cases with the 20-day moving average of cases. By dividing the daily cases by this average and subtracting one, the metric offers a percentage deviation from the recent average, signifying whether the current day's cases are above or below the recent trend. Let Ncases represent the number of COVID-19 cases reported on a specific day, and Mcases in 20 days represent the 20-day moving average of the number of COVID-19 cases. The formula is:
$$Severity=\frac{{N}_{cases}}{{M}_{cases} in 20 days}-1$$
3.4 Evaluation metrics
This experiment predicts short-term stock trends, and the output is "bullish" or "bearish." Accuracy, precision, recall, and F1 scores are selected as evaluation indicators. Accuracy measures the percentage of accurately classified samples predicted by the framework to the sample count. Precision is the proportion of correctly "bullish" labeled samples versus all "bullish" labeled samples. A higher precision indicates a higher proportion of accurate optimistic predictions made by the model. Recall reflects the ratio properly labeled "bullish" versus those actually "bullish." A higher recall indicates that the model can capture more actual positive instances. The F1 score measures both Precision and Recall. A higher F1-Score indicates that the model performs well in terms of both precision and recall. Their formulations can be defined as follows (Table 3):
3.5 Granger causality test
Applying the Granger causality test necessitates representing the data as a stationary time series. Consequently, the Augmented Dickey-Fuller (ADF) test is utilized to verify the stability of the data before the Granger causality test.
The Granger causality test verifies if one time series potentially incites another. Predictions of both variables, stock trends (y) and sentiment (x), are encompassed within their respective time series. The subsequent equations are estimated:
where yt and xt are the current period's values of the two-time series, and a, b, c, d are the coefficients of the lagged values. p is the maximum number of lags included in the model, e1t and e2t are the error terms at time t.
In time series causal analysis, impulse response analysis can serve as an instrumental tool for discerning causal relationships between time series. Notably, these techniques facilitate quantitative assessment of the magnitude and directionality of such causal relationships.
4 Results and analysis
4.1 Summary of collected data
Through the Tushare data API, we have gathered data spanning 845 trading days from January 1, 2020, to June 9, 2023. The collected data encapsulate a range of variables, including open price, high price, low price, close price, volume, and turnover rate. Additionally, from Eastmoney, Yuncaijing, and 10jqka, 2,400,329 financial news articles have been collected. For the COVID-19 data, we have collected data spanning the same days to stock data in China. It has reported approximately 99.3 million confirmed cases of COVID-19.
Figure 3 shows the correlation between different stocks. The stock pairs CCGX and KLY, CCGX and PZH, and KLY and PZH have strong positive correlations greater than 0.9. This means these stocks tend to move in the same direction; when one increases, the other also tends to grow, and vice versa. The stock pair CCGX and YLYY have the lowest correlation, around 0.16. This indicates a minimal linear relationship between the movements of these two stocks. The stocks CHKJ, FXYY, HLSW, JAYL, and YLYY have moderate to high positive correlations, suggesting that these stocks also tend to move together.
Figure 4 showcases the distribution of closing prices for each stock. Most stocks show a roughly unimodal distribution, with a single peak in the closing prices. The stock CCGX shows a right-skewed distribution. This means that most closing prices are lower, but there are a few days with exceptionally high closing prices. The stock PZH also shows a right-skewed distribution, with most closing prices concentrated on the lower side. The stocks CHKJ, FXYY, HLSW, JAYL, and YLYY show relatively symmetric distributions, with most closing prices concentrated around the middle.
Figure 5 shows boxplots for the closing price for each stock. The stocks CCGX, KLY, and PZH show a wide range of closing prices, as indicated by the large boxes and long whiskers. This suggests significant volatility in these stocks. The stocks CHKJ, FXYY, PZH, and YLYY have several outliers on the high end, indicating days with exceptionally high closing prices. The stocks CHKJ, FXYY, HLSW, JAYL, and YLYY show relatively more minor ranges of closing prices, as indicated by smaller boxes and shorter whiskers. This suggests less volatility in these stocks. The median closing price (indicated by the line within the box) varies significantly among the stocks, with PZH having the highest median price and CHKJ having the lowest.
Figure 6 illustrates the sentiment score distributions for each stock. Each colored curve with its shaded region represents a stock's sentiment distribution. A peak in a curve indicates a concentration of data points, suggesting that the sentiment score at that point is more frequent. All stocks have sentiment scores primarily ranging between 0 and 4. Most stocks have a prominent peak around the sentiment score of 3. This suggests that positive sentiment is predominant across most of these stocks. Stocks like FXYY and CHKJ have sentiment distributions that are notably narrow and concentrated around the score of 3, suggesting a consistent positive sentiment for these stocks.
4.2 Methods for data processing and model training
4.2.1 Data processing
The datasets were subjected to a series of preprocessing steps to ensure their suitability for the subsequent modeling phase. Missing values, if any, were handled appropriately, and categorical variables were converted into numerical representations. The data was then standardized to bring all the features to a comparable scale. The dataset was split into training and testing subsets in a 9:1 ratio, with the time window size set to seven. This division allowed for robust training while reserving a significant portion for the validation of the models. This split ensured the models were assessed on unseen data, providing a fair evaluation of their generalization capabilities.
4.2.2 Model training
The LSTM + Transformer model was designed to capture both short-term and long-term dependencies in the data. The configuration is shown in Table 4. The model was trained using the Adam optimizer, with a learning rate of 0.003 over 100 epochs. A mean squared error (MSE) loss function was utilized, as given by the equation:
where n is the total number of observations, yi is the actual value for the i-th observation, and \(\widehat{y}\) i is the predicted value for the i-th observation. This choice of loss function promotes the minimization of significant prediction errors, enhancing the model's predictive accuracy.
The SVM, detailed in Table 5, was configured with a radial basis function kernel to enable nonlinear classification. The choice of kernel, along with the regularization parameter C, governs the trade-off between complexity and misclassification, making the model flexible yet robust. The Random Forest algorithm leverages an ensemble of decision trees, boosting the model's performance. Parameters such as the number of estimators, the maximum depth of the trees, and the criterion for splitting contribute to a model adept at capturing intricate data patterns without succumbing to overfitting. Table 6 shows the detailed configuration.
As shown in Table 7, a Gaussian Naïve Bayes model was implemented, assuming the Gaussian distribution of features. The simplicity and efficiency of Naïve Bayes make it a valuable addition to the ensemble of models tested, providing a distinct perspective on the dataset.
The methodologies described herein offer a comprehensive and well-founded approach to data processing and model training. Each model was chosen and configured to suit the specific data characteristics, aiming for robustness and interpretability. The insights gleaned from these various models contribute to a multifaceted understanding of the underlying patterns within the data, forming the foundation for informed decision-making and further investigation.
5 Performance comparison
When evaluating stock price trend prediction models, especially in the complex and ever-evolving pharmaceutical industry, it is essential to delve deeply into the results obtained from various models to decipher which provides the most reliable forecasting, as shown in Table 8.
5.1 Model-wise comparative analysis
The LSTM + Transformer model exhibited strong performance across several stocks, with exceptionally high accuracy for PZH and CHKJ. This model demonstrated a balanced approach, achieving good precision and recall scores, indicative of its ability to accurately predict and capture actual trend movements. The F1 scores for these stocks were also commendable, highlighting the model's effectiveness in balancing precision and recall. In contrast, the SVM model showed a broader range of performance. For instance, it achieved a moderate level of accuracy for CHKJ but lower scores for other stocks like FXYY. The variation in precision and recall across stocks suggests that the SVM model may need help consistently capturing complex stock data patterns. The Random Forest model displayed notable consistency with specific stocks. For example, it performed well on CHKJ and KLY. However, its performance varied across other stocks, as seen with YLYY, where it achieved an accuracy of 0.6363. This indicates a potential for strong performance in specific contexts but a need for further refinement for broader applicability. The Naïve Bayes model showed its strength with certain stocks like CHKJ, achieving an accuracy of 0.5714, and performed reasonably well on other stocks like PZH. However, its performance was less consistent across the board, highlighting its potential in specific scenarios and its limitations in broader applications.
The LSTM + Transformer model stands out for its consistently high performance, particularly in stocks like CHKJ and HLSW. While showing competitive results in some instances, the SVM model requires more balance in its predictive capabilities. The Random Forest model is effective in specific contexts, particularly with stocks like CHKJ and KLY. The Naïve Bayes model, with notable successes in certain stocks, highlights its utility and variability across different stock indices.
5.2 Feature impact and data complexity
The strength of the LSTM + Transformer lies in its capacity to model sequential dependencies in stock data and attend to the most crucial features using its attention mechanism. Its performance indicates that stock data possesses intricate temporal relationships that benefit from complex models. Models like SVM, which try to find a hyperplane in the feature space, might find it challenging when features have multi-collinearities or the data needs to be more linearly separable. Random Forest leverages decision trees, which can model nonlinear relationships. Its varying performance might indicate that some stock indices have patterns easily captured by tree splits, but others require more nuanced modeling. Naïve Bayes' performance reiterates that stock data can sometimes align with probabilistic distributions, but not always, given its variability across indices.
The Kruskal–Wallis test result shows that the test statistic is 36.85, and the p-value is 0.000005, far less than 0.05, which rejects the null hypothesis. This implies a significant difference in sentiment data distribution across the different stocks.
5.3 Results of the causality test
5.3.1 ADF test
The ADF test, a recognized method to ascertain the characteristics of a time series dataset, was employed on eight different stocks, with the results compiled in Table 9. This table offers some noteworthy revelations. First, all the ADF statistics display negative values with considerable absolute magnitudes, which can indicate stationary patterns within the time series data for each stock. Additionally, every p-value is significantly lower than the threshold of 0.05, providing robust evidence against the null hypothesis. Lastly, when analyzed at different significance levels (1%, 5%, 10%), the ADF statistics fall beneath the respective critical values, further substantiating the stationarity of these stock series.
5.3.2 Granger causality test
We conduct the Granger causality test on sentiment and stock trend with four lag orders (see Table 10). For the stock CCGX, at the first three tested lag orders, the p-values are less than 0.05. This indicates we reject the null hypothesis at the level of 95%, which is that sentiment does not Granger-cause stock trend. When at the fourth lag order, the p-value is slightly above 0.05, indicating we cannot reject the null hypothesis at the level of 95%. This suggests that while sentiment can influence stock trends for CCGX at shorter lag periods, the effect diminishes over longer periods. The immediate significant results for CCGX imply a quick market response to sentiment changes, reflecting a more efficient market where information is quickly absorbed.
For the stock YLYY, at the third and fourth lag orders, the p-values are both below 0.05. This indicates that we reject the null hypothesis at these lag orders. This result was not observed at the first and second lag orders, indicating a delayed effect where sentiment accumulates and influences the stock trend after a certain period. The delayed significant results for YLYY imply a cumulative effect of sentiment, where the influence of sentiment information takes time to be reflected in the stock prices.
For the other stocks, the p-values at all tested lag orders are greater than 0.05. Hence, we cannot reject the null hypothesis, which indicates that sentiment does not Granger cause the stock trend variations for these indices. These findings highlight the complexity of the relationship between sentiment and stock trends. The significant results for CCGX and YLYY suggest that sentiment can influence stock trends under certain conditions and different lag periods, resulting in diverse effects over time.
Table 11 represents the results of the Granger causality test with a modified null hypothesis, which is that the stock trend is not a Granger cause for sentiment. For the stock CCGX, the p-values for lag orders 1, 2, and 3 are less than 0.05. Therefore, we reject the null hypothesis and infer that the stock trend is a Granger cause for sentiment changes in the case of CCGX. This indicates a bidirectional Granger causality between sentiment and stock trends for CCGX, where not only does sentiment affect stock trends, but stock trends also influence sentiment.
For the stock KLY, the p-value at lag order 4 is slightly below 0.05, indicating that we reject the null hypothesis at this lag order. This suggests that the stock trend does not immediately influence sentiment but may have a delayed influence on sentiment for the stock. This delayed response can be due to several factors, such as the time it takes for investors to process information, the gradual realization of market trends, or the accumulation of market signals before impacting sentiment.
For other stocks, the p-values for all lag orders are more significant than 0.05, indicating that the stock trend does not Granger-cause sentiment changes for these stocks. The results indicate that during periods marked by positive investor sentiment, their predictive capacity concerning stock prices and economic conditions tends to enhance, leading to a proclivity for stock acquisition. This inclination results in an escalation of trade volume and an elevation of returns, ultimately propelling an upward trajectory for the stock market. Conversely, during times characterized by negative sentiment, the capacity of investors to forecast future stock prices is notably impaired. Under these circumstances, investors tend to short the market, triggering a downturn in the overall stock market trend. The results of the remaining seven stocks examined were insignificant, failing to reject the null hypothesis. This suggests no discernible Granger causality exists between these stocks' sentiment and trend.
Table 12 shows the Granger causality test on the COVID indicator (CI) and stock trend with four lag orders. The CI does not appear to be a Granger Cause for Stock Trend Variation across the lags tested for most stocks. The evidence strongly suggests no predictive causality between these stocks' trends and the CI. Only in the case of JAYL at the third and fourth lag is there evidence to reject the null hypothesis. This suggests that past values of the CI could predict the stock trend of JAYL with a lag of three or four periods, which may be due to JAYL's heightened sensitivity to COVID-related changes or investor perceptions. However, causation is not confirmed; it is merely a predictive pattern. The specific business model and sector of JAYL, the dissemination of COVID-related information, or latent variables might also influence this observed relationship. Further comprehensive analysis, understanding JAYL's operations, and exploring other statistical methods are crucial before drawing definitive conclusions.
The Granger causality tests reveal several key insights into the relationships between sentiment, CI, and stock trends in the pharmaceutical sector. Overall, we found significant bidirectional Granger causality for CCGX, where sentiment quickly influences stock trends and vice versa, indicating an efficient market response to sentiment changes. The analysis reveals that YLYY exhibited a lagged response of stock market trends to market sentiment over extended durations, suggesting a cumulative effect where the influence of sentiment information on stock prices unfolds gradually. KLY's results underscore a calculated behavioral pattern among investors, with sentiment adjustments following stock market trends after a significant delay. This pattern highlights the strategic nature of investment decisions in response to evolving market trends. JAYL exhibited sensitivity to COVID-related changes, showing significant Granger causality from the CI to stock trends at the later lags, indicating a potential delayed response to COVID information. For other stocks, the tests did not show significant Granger causality, indicating that sentiment and COVID indicators do not predict stock trends for these stocks. This lack of significant causality implies that different stocks have varying sensitivities to sentiment and external indicators, which could be influenced by their specific market dynamics and investor behavior.
5.3.3 Impulse response analysis
We conduct impulse response analysis for CCGX concerning the sentiment and stock trend to verify the relationship between investor sentiment and stock trend. Figure 7 depicts four subplots of the results of the impulse response. We only need to focus on the response results of the stock trend to sentiment and sentiment to stock trend.
In the "Response of Stock Trend to Sentiment" subplot, when changes in sentiment impact the stock trend, there is a slightly expanding positive reaction in the first period, but it declines gradually after the second period. The mutability tends to be zeros in the subsequent period, and the amplitude gradually decreases. This indicates that the change in sentiment positively impacts the stock trend, and the impact is decreasing gradually. The effect in the first period is significant, and the effect is still substantial after the second period. This shows that changes in sentiment have a long-term positive impact on stock trends, but the magnitude of this effect diminishes gradually, aligning with the Granger causality test results that identified sentiment as a Granger cause for the stock trend in CCGX. It means that the real-time nature of investor sentiment causes investors to behave irrationally. Investors will follow their bullish or bearish sentiment to reduce or increase their positions, resulting in stock trend changes.
The "Response of Sentiment to Stock Trend" subplot shows that when a stock trend is given a positive impact of one standard deviation unit, it causes a rapid positive impact on the sentiment. Then, the prompt weakens and tends to zero in the fourth unit. This indicates that stock trends can positively impact sentiment in the short term, corroborating the Granger causality test results that identified stock trends as a Granger cause for sentiment in CCGX. This result implies that stock trends can influence investor sentiment, which may affect subsequent market behavior.
Figure 8 depicts the result of the impulse response focusing on the response of stock trend to sentiment for YLYY. We observe that when sentiment is given a positive impact of one standard deviation unit, it causes an initial positive impact on the stock trend in the first period and then gradually declines, tending towards zero by the seventh period. The result aligns with the Granger causality findings, where impacts were observed at longer lags. The delayed yet positive impact of sentiment on stock trends suggests that sentiment information accumulates over time and eventually influences the trends.
The impulse response analysis for KLY explores the relationship between investor sentiment and stock trends. Figure 9 presents the response of sentiment to changes in stock trends. A positive impact on the stock trend leads to a slightly positive response in sentiment. The response remains slightly positive until the sixth period. This delayed response implies that investors might require a consistent observation of stock trends over several periods before adjusting their sentiment. This behavior reflects a more strategic approach, where investors wait to confirm trends before changing their sentiment. The initial significant response and its gradual decline highlight the transient nature of this impact, with sentiment adjustments occurring over time.
The impulse response analysis for JAYL delved deeper into the intricate relationship between COVID severity indicators and stock trends. As depicted in Fig. 10, a noteworthy pattern emerges when variations in COVID severity impact the stock trends. An initial positive momentum is observed, expanding progressively across the first five periods. This surge, however, begins to wane post the fifth period, manifesting a subtle decline but eventually stabilizing beyond the tenth period. This trajectory suggests that JAYL's stock trends are not merely responsive to shifts in COVID severity but are positively amplified during the initial phases. Despite the subsequent attenuation, this amplification does not wane entirely but remains significant, even when moving past the sixth period. Such persistent effects underscore the longer-term influence of the COVID-19 severity changes on JAYL stock performance. Furthermore, the sustained magnitude of this effect implies that its impact is unlikely to diminish shortly, hinting at a durable linkage between pandemic severity and JAYL's stock trajectory.
The impulse response analyses for CCGX, YLYY, KLY, and JAYL reveal key insights into the relationships between sentiment, stock trends, and external factors. For CCGX, sentiment and stock trends exhibit a bidirectional influence with significant short-term impacts. YLYY shows a delayed response of stock trends to sentiment, indicating an accumulation effect over time. KLY's results highlight a strategic investor behavior with sentiment reacting to stock trends at a longer lag. JAYL's analysis underscores the persistent influence of COVID-19 severity on stock performance, suggesting a durable linkage between pandemic severity and market behavior. These findings emphasize the importance of considering immediate and delayed effects in market analysis and strategy development.
5.4 Word cloud
Figures 11 and 12 show the word clouds of CCGX comments. The dominant terms are "Changchun High-tech," "equity," "limit-up," and "somatropin" (a growth hormone). These words suggest that the main topics around the CCGX are likely related to the company's core business and products. "Somatropin," a growth hormone, indicates a focus on one of their major products or research areas. "Equity" suggests conversations may also be centered around stock ownership, capital structure, or equity-based transactions. The sentiment indicators advance, growth, limit-up, and buy-in suggest a generally positive sentiment around CCGX. Terms like "procurement" and "growth" denote positive movement, while "limit-up" indicates rising stock prices, and "buy-in" implies positive investment advice or behavior.
Financial performance discussions are evident from terms like "northbound funds," "profit," 100 million yuan," and "9.5 billion yuan" suggest discussions around financial performance, investment, market capitalization, or revenue numbers. The unexpected words such as "shareholders reducing holdings," "abuse," "lower-limit," "information disclosure," "fake news," "short-selling event," and "recombinant" points to past or ongoing controversies or adverse events. These issues could potentially impact the stock's performance or investor sentiment. They may warrant further investigation to understand their context and potential impact. The word cloud reveals a generally positive sentiment towards CCGX, with prominent themes focusing on equity, core products, and financial performance. However, it hints at some controversies or challenges that might affect the stock's performance.
Figures 13 and 14 display word clouds generated from news comments related to YLYY. Key insights can be derived from the prominent terms in these visualizations. Dominant terms include "Yiling Pharmaceutical," "Lianhua Qingwen," "capsule," and "registration," indicating that discussions primarily focus on the company's core products and regulatory affairs. The term "Lianhua Qingwen," a traditional Chinese medicine widely recognized for its use in treating viral infections, significantly emphasizes this pharmaceutical product. "Capsule" and "tablets" indicate discussions about various forms of this medication. The term "registration" implies frequent mentions of regulatory approvals and certifications, which are critical for pharmaceutical companies. Positive sentiment is reflected in words such as "obtain," "approval," "certificate," and "profit," suggesting an overall optimistic outlook on YLYY's operations and financial performance. Terms like "FDA," "subsidiary," and "RMB" denote positive financial movements and regulatory achievements, indicating successful business expansion and financial health.
Figures 15 and 16 display word clouds generated from news comments related to KLY. Dominant terms include "Asymchem Lab," "pharma," "share," and "plan," indicating that discussions are primarily centered around the company's shares, laboratory activities, and strategic plans. The terms "share" and "plan" suggest frequent discussions about stock performance and strategic initiatives. Terms like "CRO," "investment," "IPO," and "project" denote positive business activities and growth opportunities, indicating successful expansions and investments. Additionally, terms like "Frontier Biotech" suggest focusing on innovative research and development within the company. However, terms such as "concern" and "limit down" indicate potential regulatory challenges or market volatility the company might face. Terms like "intense institutional game" suggest competitive dynamics among institutional investors, while " revision of the company's non-public issuance plan" points to strategic financial decisions that could impact investor sentiment.
Figures 17 and 18 showcase the word clouds for JAYL. "Andon Health" emerges as a dominant player in the word clouds, solidifying its stature as a central topic of discussion. The prominence of such a term usually indicates significant developments or movements surrounding the entity, suggesting that any news or events related to Andon Health could have overarching impacts on related narratives or market sentiments. Within the associated terms, we notice the significant emphasis on words that connect to stock market movements, such as "Touched the Limit Up" and its English counterparts "test" and "limit." This implies heightened volatility or potential speculative interest in Andon Health's stock. The co-occurrence of terms like "Speculative Stock" and "Rise to the Limit Again" further cements this narrative, suggesting that the stock might have been on a roller-coaster ride, fluctuating between bullish enthusiasm and potential bearish retractions. Sentiment indicators, however, present a mixed picture. While terms like "COVID," "test," and "kit" offer a positive-skewed lens, hinting at a business model or product line catering to the pandemic response, other times infuse a tone of caution. Phrases like "Net Profit Forecast Decrease" or the simple "net" in the English word cloud imply some challenges on the financial front.
Moreover, words like "Key Monitoring" and the "issued regulatory letter" shed light on potential regulatory scrutiny or interventions the company might have experienced or is experiencing. The unexpected words add another layer to our analysis. "Omicron," a COVID-19 variant, insinuates a dynamic landscape where Andon Health may adapt or respond to evolving health challenges. On the other hand, terms like "rumor" or "Interest Transfer" hint at potential unverified information or even controversies surrounding the firm. In essence, the narratives surrounding Andon Health, as captured by the word clouds, depict a company at the intersection of opportunity and challenge. It is navigating the complexities of the pandemic market and regulatory ecosystems and possibly even battling rumors while striving to maintain its business prominence and financial stability.
6 Discussion
6.1 Exploration of multimodal data (RQ1)
In addressing our first research question, we have endeavored to understand the capacity of fused multimodal stock market data to predict short-term stock price trends within the pharmaceutical industry. This study's results present compelling evidence that such multimodal information, which amalgamates traditional stock figures, nuanced textual data, and real-time COVID-related data, can predict stock trends, indicating whether they are more likely to follow a bullish or bearish trajectory. Our results corroborate existing literature on the efficacy of multimodal information [19, 50], highlighting its superiority over unimodal data sources.
In our approach, we employed advanced text classification and sentiment analysis techniques. These techniques allowed us to delve deeper into investor comments gleaned from various online forums, effectively translating raw, often chaotic, text data into structured, analyzable insights. We have then used these insights to construct corresponding indicators, which are integral to our prediction models.
We conducted a thorough comparative analysis to determine our diverse dataset's most effective classification model and ensure we cater to various stock types and market scenarios. This analysis pitted several established classifiers, namely SVM, Random Forest, Naïve Bayes, and a hybrid model comprising LSTM and Transformer, against each other. Our findings revealed that the LSTM and Transformer blend provides the most balanced results. This result was consistent with Ho et al. [12],Moghar and Hamiche [27],Yang and Wang [52], who found LSTM and Transformer effective in sequential prediction tasks, but in difference with Nabipour et al. [29], who found SVM was superior in stock trend prediction. The discrepancy could stem from multiple factors, such as varying datasets, chosen indicators, and other nuanced methodological differences. Interestingly, however, we observed that for certain specific stocks, like PZH, the Naïve Bayes model was of greater pertinence.
6.2 Sentiment and COVID indicator's impact (RQ2)
In response to the second research question, we have endeavored to evaluate the extent of the impact that financial textual and COVID-related data have on stock price trends within the pharmaceutical industry. Building on the premise set by Baker and Wurgler [4], Okunev and White [32], and Thomsett [43], we expanded the set of indicators, encompassing not just stock market trend and intraday trading indicators but also including sentiment analysis based on investor commentaries, COVID-related indicators and additional variables extracted from our original dataset. We used these indicators to study the intricate interplays between investor sentiment, COVID severity, and stock trends. To this end, we employed Granger causality and Impulse response tests, recognized tools for studying the cause-effect relationships between various factors, which align with methodologies from Lazzini et al. [18] and Mudinas et al. [28], who utilize the Granger causality test to analyze the causal relationship. Our findings led us to a fascinating observation: the investor sentiment towards CCGX sentiment and its price trends were identified as Granger causes of each other. This mutually causal relationship suggests that any impact on investor sentiment or stock trends invariably influences the other, indicating a cyclical, interconnected relationship echoing findings from Hu et al. [15].
The study on YLYY revealed a delayed but significant impact of investor sentiment on stock trends. This analysis elucidates that investor sentiment accrues over a period before manifesting its impact on stock valuations, underscoring the criticality of acknowledging immediate and postponed effects within market analysis paradigms. Differently, in examining KLY's scenario, it was determined that the stock trends serve as a Granger causality for investor sentiment with a longer lag period. This finding suggests a lagged response mechanism, where investors might require the observation of consistent trend patterns spanning multiple periods before sentiment adjustment. This phenomenon indicates strategic investor behavior intricacies and broader market dynamic mechanisms. Similarly, we found that the severity of the COVID pandemic, as measured by our COVID indicator, was a Granger cause for changes in JAYL stock trends. This discovery emphasizes the fact that external factors, such as a global pandemic, can exert a significant impact on stock trends.
6.3 Word cloud analysis
In this study, we adopted the technique of word cloud analysis to further investigate investor sentiments and associated challenges for two significant stocks, CCGX, YLYY, KLY, and JAYL. This form of qualitative visualization provided a rich and layered understanding of investor sentiment and the intricate operational details and potential market challenges these companies face. In the case of CCGX, prominent lexemes illuminated the company's primary areas of operation, emphasizing its involvement with growth hormones, equity dynamics, and bullish stock trajectories. Nevertheless, a subtle undercurrent of potential controversies emerged from the cloud, indicating a need for a deeper probe into how these factors might impact stock performance. The word cloud for YLYY highlighted a focus on regulatory approvals and its core product, Lianhua Qingwen, indicating strong financial performance. Yet, terms like "limit" and "concern" suggest regulatory challenges and market volatility that could impact its growth. For KLY, the analysis revealed strategic plans and financial achievements while pointing out significant market volatility and competitive pressures. The word cloud for JAYL primarily underscored the company's engagement with the unfolding COVID-19 pandemic. The investor discourse seemed to hover between bullish optimism, driven by the company's proactive response to the pandemic, and cautionary restraint, dictated by potential financial challenges and regulatory interventions. Unexpected terms emerging in both word clouds suggested a turbulent market landscape molded by evolving health challenges and possible misinformation.
6.4 Limitations
While our research provides crucial insights, we also acknowledge its limitations. Our study employs established machine learning models, but the innovation in model algorithms is limited. Our focus was primarily on a select few stocks, specifically CCGX, YLYY, KLY, and JAYL. The generalizability of our findings may be constrained due to the limited scope of our analysis. The predictive power of sentiment and COVID indicators might vary across different stocks or sectors. Therefore, our results should be interpreted within this pharmaceutical context. Secondly, our sentiment analysis heavily relied on the quality and representativeness of our chosen news data. Thus, the accuracy of our sentiment analysis hinges on this dataset's quality and nature. Employing a different dataset or sentiment analysis technique might yield different results. Additionally, the time frame and external factors of this research, such as policy changes, economic conditions, and unforeseen events, could influence the results.
6.5 Implications
Our research has significant implications for theoretical development and practical application in stock market prediction. Theoretically, this research underscores the utility of multimodal information in short-term stock price trend prediction, contributing a novel understanding of the literature on financial forecasting. It sheds light on the intricate interplay of diverse data sources, such as financial textual and pandemic-related information, in shaping stock price movements. Our findings, suggesting that investor sentiment and the severity of the COVID-19 pandemic are Granger causes for changes in specific stock trends, offer valuable insights for further theoretical explorations. The study contributes to a nuanced understanding of the dynamism in China's stock market, opening new avenues for subsequent academic inquiries.
From a practical perspective, this research can guide investors and market regulators in their decision-making process. Our findings give them a robust foundation for predicting stock trends, particularly within the volatile pharmaceutical industry. The insights into the effect of investor sentiment and pandemic severity on stock trends could also support more effective management of stock commentary and public perception. Furthermore, the findings can aid in creating more sophisticated stock prediction models, thereby enhancing investment strategies and regulatory practices.
7 Conclusion
In this research, we embarked on a comprehensive exploration to uncover the potential of multimodal market information in predicting short-term stock price trends, particularly in the pharmaceutical industry. Our analysis investigates the efficacy of fused data sources in forecasting stock trends and the influence of financial textual and COVID-related information on these trends.
The findings suggest that multimodal information, particularly when enriched with textual sentiment analysis and COVID indicators, can be highly effective in predicting stock trends. We unearthed substantial correlations, particularly sentiment analysis, revealing intriguing interdependencies between investor sentiment and stock trends. The efficacy of different classifiers, especially LSTM and Transformer, also surfaced as a crucial factor in the prediction process.
Despite the research's limitations regarding the scope of stocks analyzed and the dependence on the quality of news data, our findings provide crucial insights and represent a significant stride in enhancing the sophistication of financial prediction models with information fusion. The results of this study offer tangible benefits to investors, regulators, and future research endeavors, leading to more rational investment decisions, effective management of stock commentary, and a richer understanding of the complex dynamics at play in the stock market.
This research prompts more in-depth investigations into using multimodal information for financial forecasting, with potential expansion to other sectors beyond the pharmaceutical industry. Exploring different data types and machine learning algorithms and integrating further real-world indicators could be beneficial. Our research provides a solid basis for further innovation and refinement in the quest for an optimal financial prediction model with various information fusions.
Future research could explore the development and application of more advanced and novel prediction models to enhance predictive performance and capture more complex relationships. Moreover, the analysis should be broadened to include more stocks and test alternative sentiment analysis techniques. Exploring different market conditions and external factors, along with expanding the dataset to cover diverse periods and additional sentiment indicators, can enhance the robustness of the findings. This approach will provide a deeper understanding of how sentiment influences stock market behavior, especially in the pharmaceutical sector.
References
Alenazi FS, El Hindi K, AsSadhan B (2023) Complement-class harmonized naïve bayes classifier. Appl Sci 13(8):4852. https://doi.org/10.3390/app13084852
Ampountolas A (2023) The effect of COVID-19 on cryptocurrencies and the stock market volatility: a two-stage DCC-EGARCH model analysis. J Risk Financ Manag 16(1):25. https://doi.org/10.3390/jrfm16010025
Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv.org. https://doi.org/10.48550/arxiv.1803.01271
Baker M, Wurgler J (2007) Investor sentiment in the stock market. J Econ Perspect 21(2):129–151. https://doi.org/10.1257/jep.21.2.129
Caballero RJ, Simsek A (2021) A model of endogenous risk intolerance and LSAPs: asset prices and aggregate demand in a “COVID-19” shock. Rev Financ Stud 34(11):5522–5580. https://doi.org/10.1093/rfs/hhab036
Ding Q, Wu S, Sun H, Guo J, Guo J (2021) Hierarchical multi-scale Gaussian transformer for stock movement prediction. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (vol 7, pp 4640–4646). https://doi.org/10.24963/ijcai.2020/640
Ganie IR, Wani TA, Yadav MP (2022) Impact of COVID-19 outbreak on the stock market: an evidence from select economies. Bus Perspect Res 0(0):1–15. https://doi.org/10.1177/22785337211073635
Gao M, Feng C (2022) An improved ARIMA stock price forecasting method based on B-spline expansion and model averaging. Acad J Comput Inf Sci 5(10):14–20. https://doi.org/10.25236/AJCIS.2022.051003
Ghasemieh A, Kashef R (2023) An enhanced Wasserstein generative adversarial network with Gramian Angular Fields for efficient stock market prediction during market crash periods. Appl Intell 53(23):28479–28500. https://doi.org/10.1007/s10489-023-05016-2
Han H, Xie L, Chen S, Xu H (2023) Stock trend prediction based on industry relationships driven hypergraph attention networks. Appl Intell 53(23):29448–29464. https://doi.org/10.1007/s10489-023-05035-z
Haroon O, Rizvi SAR (2020) COVID-19: Media coverage and financial markets behavior—A sectoral inquiry. J Behav Exp Financ 27:100343–100343. https://doi.org/10.1016/j.jbef.2020.100343
Ho MK, Darman H, Musa S (2021) Stock price prediction using ARIMA, neural network and LSTM models. J Phys: Conf Ser 1988(1):12041–12051. https://doi.org/10.1088/1742-6596/1988/1/012041
Ho T-T, Huang Y (2021) Stock price movement prediction using sentiment analysis and CandleStick chart representation. Sensors 21(23):7957. https://www.mdpi.com/1424-8220/21/23/7957
Hu J, Jiang GJ, Pan G (2020) Market reactions to central bank interest rate changes: evidence from the Chinese stock market. Asia Pac J Financ Stud 49(5):803–831. https://doi.org/10.1111/ajfs.12316
Hu J, Sui Y, Ma F (2021) The measurement method of investor sentiment and its relationship with stock market. Comput Intell Neurosci 2021:6672677. https://doi.org/10.1155/2021/6672677
Huang P-S, Paulino YC, So S, Chiu DK, Ho KK (2022) Guest editorial: COVID-19 pandemic and health informatics Part 2. Library Hi Tech 40(2):281–285
Huang P-S, Paulino YC, So S, Chiu DK, Ho KK (2023) Guest editorial: COVID-19 pandemic and health informatics part 3. Library Hi Tech 41(1):1–6
Lazzini A, Lazzini S, Balluchi F, Mazza M (2022) Emotions, moods and hyperreality: social media and the stock market during the first phase of COVID-19 pandemic. Account Audit Accountability J 35(1):199–215. https://doi.org/10.1108/AAAJ-08-2020-4786
Lee T-W, Teisseyre P, Lee J (2023) Effective exploitation of macroeconomic indicators for stock direction classification using the multimodal fusion transformer. IEEE Access 11:10275–10287. https://doi.org/10.1109/ACCESS.2023.3240422
Li X, Wu P, Wang W (2020) Incorporating stock prices and news sentiments for stock market prediction: a case of Hong Kong. Inf Process Manag 57(5):102212. https://doi.org/10.1016/j.ipm.2020.102212
Li Y, Zhuang X, Wang J, Dong Z (2021) Analysis of the impact of COVID-19 pandemic on G20 stock markets. North Am J Econ Finan 58:101530–101530. https://doi.org/10.1016/j.najef.2021.101530
Lin Z, Feng M, Santos CNd, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. 5th International Conference on Learning Representations, Toulon, France. https://doi.org/10.48550/arXiv.1703.03130
Liu J, Li X, Wei Q, Liu S, Liu Z, Wang J (2023) A two-phase random forest with differential privacy. Appl Intell 53(10):13037–13051. https://doi.org/10.1007/s10489-022-04119-6
Liu Z, Huynh TLD, Dai P-F (2021) The impact of COVID-19 on the stock market crash risk in China. Res Int Bus Financ 57:101419–101419. https://doi.org/10.1016/j.ribaf.2021.101419
Mathieu E, Ritchie H, Rodés-Guirao L, Appel C, Giattino C, Hasell J, Macdonald B, Dattani S, Beltekian D, Ortiz-Ospina E, Roser M (2020) Coronavirus Pandemic (COVID-19). OurWorldInData.org. Retrieved July 7, 2023 from https://ourworldindata.org/coronavirus
Meng Y, Chu MY, Chiu DK (2023) The impact of COVID-19 on museums in the digital era: practices and challenges in Hong Kong. Library Hi Tech 41(1):130–151
Moghar A, Hamiche M (2020) Stock market prediction using LSTM recurrent neural network. Procedia Comput Sci 170:1168–1173. https://doi.org/10.1016/j.procs.2020.03.049
Mudinas A, Zhang D, Levene M (2019) Market trend prediction using sentiment analysis: lessons learned and paths forward. arXiv.org. https://doi.org/10.48550/arxiv.1903.05440
Nabipour M, Nayyeri P, Jabani H, S S, Mosavi A (2020) Predicting stock market trends using machine learning and deep learning algorithms via continuous and binary data; a comparative analysis. IEEE Access 8:150199–150212. https://doi.org/10.1109/ACCESS.2020.3015966
Niu H, Xu K, Wang W (2020) A hybrid stock price index forecasting model based on variational mode decomposition and LSTM network. Appl Intell 50(12):4296–4309. https://doi.org/10.1007/s10489-020-01814-0
Nti IK, Adekoya AF, Weyori BA (2020) Predicting stock market price movement using sentiment analysis: Evidence from Ghana. Appl Comput Syst (Online) 25(1):33–42. https://doi.org/10.2478/acss-2020-0004
Okunev J, White D (2003) Do momentum-based strategies still work in foreign currency markets? J Financ Quant Anal 38(2):425–447. https://doi.org/10.2307/4126758
Prasetijo AB, Isnanto RR, Eridani D, Soetrisno YAA, Arfan M, Sofwan A (2017) Hoax detection system on Indonesian news sites based on text classification using SVM and SGD. 2017 4th International Conference on Information Technology, Computer, and Electrical Engineering (ICITACEE) (pp. 45–49), Semarang, Indonesia. https://doi.org/10.1109/ICITACEE.2017.8257673
Qian F, Chen X (2019) Stock prediction based on LSTM under different stability. 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA). https://doi.org/10.1109/ICCCBDA.2019.8725709
Shankar K, Lakshmanaprabu SK, Gupta D, Maseleno A, de Albuquerque VHC (2020) Optimal feature-based multi-kernel SVM approach for thyroid disease classification. J Supercomput 76(2):1128. https://doi.org/10.1007/s11227-018-2469-4
Sharma GD, Tiwari AK, Jain M, Yadav A, Erkut B (2021) Unconditional and conditional analysis between covid-19 cases, temperature, exchange rate and stock markets using wavelet coherence and wavelet partial coherence approaches. Heliyon 7(2):e06181–e06181. https://doi.org/10.1016/j.heliyon.2021.e06181
Sherstinsky A (2020) Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D 404:132306. https://doi.org/10.1016/j.physd.2019.132306
Song Y, Lee JW, Lee J (2019) A study on novel filtering and relationship between input-features and target-vectors in a deep learning model for stock price prediction. Appl Intell 49(3):897–911. https://doi.org/10.1007/s10489-018-1308-x
Sun J, Zeng Z, Li T, Sun S (2024) Analyzing the spatiotemporal coupling relationship between public opinion and the epidemic during COVID-19. Library Hi Tech 42(6):1880–1904. https://doi.org/10.1108/LHT-10-2022-0462
Swathi T, Kasiviswanath N, Rao AA (2022) An optimal deep learning-based LSTM for stock price prediction using twitter sentiment analysis. Appl Intell 52(12):13675–13688. https://doi.org/10.1007/s10489-022-03175-2
Tay Y, Dehghani M, Bahri D, Metzler D (2023) Efficient transformers: a survey. ACM Comput Surv 55(6):1–28. https://doi.org/10.1145/3530811
Thakkar A, Chaudhari K (2021) Fusion in stock market prediction: A decade survey on the necessity, recent developments, and potential future directions. Inf Fusion 65:95–107. https://doi.org/10.1016/j.inffus.2020.08.019
Thomsett MC (2011) Trading with candlesticks: visual tools for improved technical analysis and timing, vol 26. Ringgold, Inc., Portland
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, California, USA. https://dl.acm.org/doi/10.5555/3295222.3295349
Wang C-H, Yuan J, Zeng Y, Lin S (2024) A deep learning integrated framework for predicting stock index price and fluctuation via singular spectrum analysis and particle swarm optimization. Appl Intell 54(2):1770–1797. https://doi.org/10.1007/s10489-024-05271-x
Wang Y, Guo Y (2020) Forecasting method of stock market volatility in time series data based on mixed model of ARIMA and XGBoost. China Commun 17(3):205–221. https://doi.org/10.23919/JCC.2020.03.017
Wang Y, Xia S-T, Tang Q, Wu J, Zhu X (2018) A novel consistent random forest framework: Bernoulli random forests. IEEE Trans Neural Netw Learn Syst 29(8):3510–3523. https://doi.org/10.1109/TNNLS.2017.2729778
WHO (2020) WHO Director-General's Opening Remarks at the Media Briefing on COVID-19 - 11 March 2020 [announcement]. SyndiGate Media Inc. Retrieved July 3, 2023 from https://www.who.int/director-general/speeches/detail/who-director-general-s-opening-remarks-at-the-media-briefing-on-covid-19---11-march-2020
Xiang Z-L, Wang R, Yu X-R, Li B, Yu Y (2023) Experimental analysis of similarity measurements for multivariate time series and its application to the stock market. Appl Intell 53(21):25450–25466. https://doi.org/10.1007/s10489-023-04874-0
Xu Y, Cohen SB (2018) Stock movement prediction from tweets and historical prices. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. https://doi.org/10.18653/v1/P18-1183
Yang J, Yang C (2021) Economic policy uncertainty, COVID-19 lockdown, and firm-level volatility: Evidence from China. Pac Basin Financ J 68:101597–101597. https://doi.org/10.1016/j.pacfin.2021.101597
Yang M, Wang J (2022) Adaptability of financial time series prediction based on BiLSTM. Procedia Comput Sci 199:18–25. https://doi.org/10.1016/j.procs.2022.01.003
Yang Y, Fan C, Xiong H (2022) A novel general-purpose hybrid model for time series forecasting. Appl Intell 52(2):2212–2223. https://doi.org/10.1007/s10489-021-02442-y
Yu P, Yan X (2020) Stock price prediction based on deep neural networks. Neural Comput Appl 32(6):1609–1628. https://doi.org/10.1007/s00521-019-04212-x
Yu PY, Lam ETH, Chiu DK (2022) Operation management of academic libraries in Hong Kong under COVID-19. Library Hi Tech 41(1):108–129
Zhang F, Narayan PK, Devpura N (2021) Has COVID-19 changed the stock return-oil price predictability pattern? Financ Innov (Heidelberg) 7(1):61–61. https://doi.org/10.1186/s40854-021-00277-7
Zhang J, Teng Y-F, Chen W (2019) Support vector regression with modified firefly algorithm for stock price forecasting. Appl Intell 49(5):1658–1674. https://doi.org/10.1007/s10489-018-1351-7
Zoungrana TD, Toé DLt, Toé M (2023) Covid-19 outbreak and stocks return on the West African Economic and Monetary Union’s stock market: An empirical analysis of the relationship through the event study approach. Int J Financ Econ 28(2):1404–1422. https://doi.org/10.1002/ijfe.2484
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, H., Xie, Z., Chiu, D.K.W. et al. Multimodal market information fusion for stock price trend prediction in the pharmaceutical sector. Appl Intell 55, 77 (2025). https://doi.org/10.1007/s10489-024-05894-0
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-05894-0