Hybrid neural conditional random fields for multi-view sequence labeling

doi:10.1016/j.knosys.2019.105151

Knowledge-Based Systems

Volume 189, 15 February 2020, 105151

https://doi.org/10.1016/j.knosys.2019.105151 Get rights and content

Highlights

•
We propose a hybrid neural CRF for multi-view sequence labeling, called MVCRF.
•
Our model combines multi-view learning by utilizing consensus and complementary principles.
•
We systematically compare the performance of MVCRF with other models.
•
The experimental results show MVCRF achieves state-of-the-art performance.

Abstract

In traditional machine learning, conditional random fields (CRF) is the mainstream probability model for sequence labeling problems. CRF considers the relation between adjacent labels other than decoding each label independently, and better performance is expected to achieve. However, there are few multi-view learning methods involving CRF which can be directly used for sequence labeling tasks. In this paper, we propose a novel multi-view CRF model to label sequential data, called MVCRF, which well exploits two principles for multi-view learning: consensus and complementary. We first use different neural networks to extract features from multiple views. Then, considering the consistency among the different views, we introduce a joint representation space for the extracted features and minimize the distance between the two views for regularization. Meanwhile, following the complementary principle, the features of multiple views are integrated into the framework of CRF. We train MVCRF in an end-to-end fashion and evaluate it on two benchmark data sets. The experimental results illustrate that MVCRF obtains state-of-the-art performance: $F_{1}$ score 95.44% for chunking on CoNLL-2000, 95.06% for chunking and 96.99% for named entity recognition (NER) on CoNLL-2003.

Introduction

Sequence labeling problems such as named entity recognition (NER) and syntactic chunking are classical tasks in the field of natural language processing (NLP). The sequential data used in these tasks mostly contains different features, such as word feature and part-of-speech (POS), which may be obtained by diverse measuring modes or come from different feature extractors. We usually call them as multi-view data. For multi-view data, a naive method is concatenating multiple views directly into a single view, then using the single-view algorithms for subsequent processing. However, this approach may lead to the overfitting problem, and the unique statistical characteristics of each view cannot be fully exploited. The alternative method is only using one of the multiple views. Generally, the best performance cannot be achieved, either.

Multi-view learning (MVL) is an emerging direction and has made great developments in machine learning for recent years, which can solve the above problems well. It aims to improve generalization performance by making full use of information from multiple views. There is an increasing number of MVL methods proposed, which can be divided into three major categories [1]: co-training style algorithms [2], [3], [4], co-regularization style algorithms [5], [6], and margin consistency style algorithms [7], [8]. Without loss of generality, we can usually obtain better performance by adopting these MVL methods. Even if there is only a natural single view available, it is possible to further improve the model performance by manually generating multiple views, which reflects the huge advantage of MVL [2]. Recently, MVL has been a highly concerned topic in machine learning [9], [10], [11], [12], [13]. However, to the best of our knowledge, among the existing MVL methods, few of them can directly hand with sequence labeling problems. In this paper, we intend to propose a new kind of multi-view method to label sequential data.

The existing sequence labeling models include two main categories. One is the linear statistical models, such as hidden Markov models [14], maximum entropy Markov models [15], and conditional random fields (CRF) [16], [17], [18], [19]. The other is the non-linear neural networks based models. For the linear statistical models, CRF [16] is a popular model for fitting time sequences and excellent at the sequential data labeling tasks. CRF does not make independence assumptions on the observations. It focuses on the information in the sentence level instead of individual position. Therefore, CRF can more correctly catch the relationship within a sequence and a higher tagging accuracy is expected to achieve. With good properties, CRF has been widely used in sequence labeling tasks and obtains respectably good performance [16], [17], [18], [19]. Among the non-linear neural networks based models, the model based on the convolutional neural network (CNN) [20] was first presented for sequence labeling. Later, some sequence labeling models based on long short-term memory (LSTM) were proposed and made great success in sequence labeling [21], [22], [23].

In this paper, we develop a hybrid neural CRF for multi-view sequence labeling, named MVCRF. The model is based on the traditional CRF and adopts diverse neural networks for feature extraction of multiple views. Different from the available models for sequence labeling, the proposed model not only considers the correlation between neighborhood labels and jointly decodes the best sequence of labels, but also combines MVL by utilizing consensus and complementary principles [24]. MVCRF first takes each view of the sequential data as the input. Since neural networks have the ability to automatically extract features from data [25], then we adopt them to respectively extract features from multiple views in the proposed model. The extracted features are projected into a joint representation space. Inspired by the idea of co-regularization method [26], we regularize the log-likelihood via minimizing the distance between two views. In other words, we enforce the features from different views to be as close as possible by minimizing the distance, which reflects the consensus principle. Moreover, considering that each view may contain some specific information that not in other views, the features from different views are taken as the input to the CRF layer. At the last, CRF takes the role of making a structural prediction and outputting the best sequence labels.

The main contributions in this paper can be summarized as follows. Our work is to propose a hybrid neural CRF for multi-view sequence labeling. The key idea in our model is combining MVL to perform the sequence labeling. We first construct a joint representation space of different features, based on the consensus principle. Then we regularize the conditional probability distribution by the consistency of diverse views. Meanwhile, we also consider the complementary principle to take full use of the specific information for each view. We systematically compare the performance of MVCRF with other models. The experimental results show MVCRF achieves state-of-the-art performance on CoNLL-2000 and CoNLL-2003 benchmark data sets.

The rest of this paper is organized as follows. Section 2 reviews the related research. Section 3 proposes our MVCRF model. Section 4 reports experimental results on benchmark data sets and makes systematical comparisons. Finally, we draw conclusions and point out possible future work in Section 5.

Section snippets

Related work

For the sequence labeling tasks, each label is not only related to the current input, but also has a correlation with the previous label. That is, the predicted labels in the sequence have strong dependence and follow specific pattern rules. For example, in the NER task with standard IOB2 labeling scheme [27], label “I-ORG” can follow “B-ORG” or other “I-ORG”, but “I-ORG” and “I-PER” is not allowed. In this case, it is not appropriate to make independence assumption. Instead of independently

Model representation

This section proposes a hybrid neural CRF model MVCRF. To illustrate our model more clearly, we first briefly introduce the basic framework used in this paper: CRF and Bi-LSTM for feature extraction. Then we describe MVCRF model and show the corresponding inference and parameter optimization in detail.

Experiments

In this section, we test our MVCRF model on CoNLL-2000 and CoNLL-2003 English data set, and report the experimental results for dealing with the two sequence tagging tasks: chunking and NER.

Conclusion

In this paper, we propose a novel hybrid neural CRF model for multi-view sequence labeling. The model uses the multi-view consistency to regularize the conditional likelihood and fully leverages the information from multiple views. Experimental results show that the proposed model achieves the best performance on two benchmark sequence labeling data sets.

In the proposed model, we exploit the Bi-LSTM and linear network to extract features from distinctive views. For the future, we will further

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Project 61673179, and Natural Science Foundation of Shanghai, PR China under Grant No. 19ZR1415800.

Xuli Sun received the B.S. degree from Shanxi University, Shanxi, China. She is currently pursuing the M.S. degree with the School of Computer Science and Technology, East China Normal University, Shanghai, China. Her current research interests include pattern recognition and machine learning.

References (44)

ZhaoJ. et al.
Multi-view learning overview: Recent progress and new challenges
Inf. Fusion
(2017)
ChaoG. et al.
Consensus and complementarity based maximum entropy discrimination for multi-view classification
Inform. Sci.
(2016)
XiaoQ. et al.
Multi-view manifold regularized learning-based method for prioritizing candidate disease miRNAs
Knowl.-Based Syst.
(2019)
WangH. et al.
A study of graph-based system for multi-view clustering
Knowl.-Based Syst.
(2019)
ZhangY. et al.
A multitask multiview clustering algorithm in heterogeneous situations based on LLE and LE
Knowl.-Based Syst.
(2019)
KupiecJ.
Robust part-of-speech tagging using a hidden markov model
Comput. Speech Lang.
(1992)
GravesA. et al.
Framewise phoneme classification with bidirectional lstm and other neural network architectures
Neural Netw.
(2005)
ElmanJ.L.
Finding structure in time
Cogn. Sci.
(1990)
NigamK. et al.
Analyzing the effectiveness and applicability of co-training
MusleaI. et al.
Active learning with multiple views
J. Artificial Intelligence Res.
(2011)

SunS. et al.

Robust co-training

Int. J. Pattern Recognit. Artif. Intell.

(2011)

ChenN. et al.

Predictive subspace learning for multi-view data: a large margin approach

SalzmannM. et al.

Factorized orthogonal latent spaces

J. Mach. Learn. Res.

(2010)

MaoL. et al.

Soft margin consistency based scalable multi-view maximum entropy discrimination

ChenN. et al.

Large-margin predictive latent subspace learning for multiview data analysis

IEEE Trans. Pattern Anal. Mach. Intell.

(2012)

SunS.

A survey of multi-view machine learning

Neural Comput. Appl.

(2013)

McCallumA. et al.

Maximum entropy markov models for information extraction and segmentation

J. Lafferty, A. Mccallum, F.C.N. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling...

L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in: Proceedings of the...

PassosA. et al.

Lexicon infused phrase embeddings for named entity resolution

(2014)

G. Luo, X. Huang, C.Y. Lin, Z. Nie, Joint Entity Recognition and Disambiguation, in: Proceedings of the 2015 Conference...

CollobertR. et al.

Natural language processing (almost) from scratch

J. Mach. Learn. Res.

(2011)

Cited by (20)

A hybrid deep-learning approach for complex biochemical named entity recognition[Formula presented]
2021, Knowledge-Based Systems
Citation Excerpt :
These computational tools make reaction analysis faster than manual approaches and allow efficient predictions of reactions of possible reagent combinations. There is no doubt that artificial intelligence, particularly deep learning, is revolutionizing our understanding of chemistry [2,3]. Despite the advances, in the field of chemical drugs, there are still several important scientific activities and processes for the extraction of information that are done manually, taking plenty of experts’ time.
Named entity recognition (NER) of chemicals and drugs is a critical domain of information extraction in biochemical research. NER provides support for text mining in biochemical reactions, including entity relation extraction, attribute extraction, and metabolic response relationship extraction. However, the existence of complex naming characteristics in the biomedical field, such as polysemy and special characters, make the NER task very challenging. Here, we propose a hybrid deep learning approach to improve the recognition accuracy of NER. Specifically, our approach applies the Bidirectional Encoder Representations from Transformers (BERT) model to extract the underlying features of the text, learns a representation of the context of the text through Bi-directional Long Short-Term Memory (BILSTM), and incorporates the multi-head attention (MHATT) mechanism to extract chapter-level features. In this approach, the MHATT mechanism aims to improve the recognition accuracy of abbreviations to efficiently deal with the problem of inconsistency in full-text labels. Moreover, conditional random field (CRF) is used to label sequence tags because this probabilistic method does not need strict independence assumptions and can accommodate arbitrary context information. The experimental evaluation on a publicly-available dataset shows that the proposed hybrid approach achieves the best recognition performance; in particular, it substantially improves performance in recognizing abbreviations, polysemes, and low-frequency entities, compared with the state-of-the-art approaches. For instance, compared with the recognition accuracies for low-frequency entities produced by the BILSTM-CRF algorithm, those produced by the hybrid approach on two entity datasets (MULTIPLE and IDENTIFIER) have been increased by 80% and 21.69%, respectively.
Adversarial robustness and attacks for multi-view deep models
2021, Engineering Applications of Artificial Intelligence
Citation Excerpt :
The multi-view models include three main categories (Zhao et al., 2017): (i) Co-training style models (Nigam and Ghani, 2000; Muslea et al., 2006; Sun and Jin, 2011) maximize the consistency between different views by training alternately, which were inspired by co-training (Blum and Mitchell, 1998); (ii) Co-regularization style models (Sun and Shawe-Taylor, 2010; Sun, 2011; Xie and Sun, 2014) maximize the likelihood on the single view and constrain the predictions of different views to be as consistent as possible, which were achieved by adding a regularization term in the objective function; iii) Margin consistency style models (Mao and Sun, 2016; Chao and Sun, 2016) leverage the latent consistency among multiple views and model the margin variables on each view. Recently, some multi-view deep models were developed, such as multi-view convolutional neural network (MVCNN) (Su et al., 2015), Show-and-Tell (Vinyals et al., 2015), Show-Attend-and-Tell (Xu et al., 2015), NeuralTalk (Karpathy and Fei-Fei, 2015), and MVCRF (Sun et al., 2019), which have made remarkable progress in 3D shape classification, image captioning, sequence labeling and other popular tasks. Although multi-view models have wide applications and superior performance, it is an open problem whether multi-view models are more robust to adversarial examples than single-view models.
Recent work has highlighted the vulnerability of many deep machine learning models to adversarial examples. It attracts increasing attention to adversarial attacks, which can be used to evaluate the security and robustness of models before they are deployed. However, to our best knowledge, there is no specific research on the adversarial robustness and attacks for multi-view deep models. Based on the fact that adversarial examples generalize well among different models, this paper takes the adversarial attack on the multi-view convolutional neural network as an example to investigate the adversarial robustness of multi-view deep models, and further proposes effective multi-view adversarial attacks. This paper proposes two strategies, two-stage attack (TSA) and end-to-end attack (ETEA), to attack against well-trained multi-view models. With the mild assumption that the single-view model on which the target multi-view model is based is known, we first propose the TSA strategy. The main idea of TSA is to attack the multi-view model with adversarial examples generated by attacking the associated single-view model, by which state-of-the-art single-view attack methods are directly extended to the multi-view scenario. Then we further propose the ETEA strategy where the multi-view model is provided publicly. The ETEA is applied to accomplish direct attacks on the target multi-view model, where we develop three effective multi-view attack methods. Extensive experimental results show that multi-view models are more robust than single-view models and demonstrate the effectiveness of the proposed multi-view adversarial attacks.
Enhanced sequence labeling based on latent variable conditional random fields
2020, Neurocomputing
Citation Excerpt :
Sun and Tsujii [33] described the latent-dynamic inference (LDI), which produces the optimal label sequence of the latent conditional models by using efficient search strategy and dynamic programming. Sun et al. [34] combined multi-view CRF learning by utilizing consensus and complementary principles for sequence labeling. It uses different neural networks for feature extraction from multiple views.
Natural language processing is a useful processing technique of language data, such as text and speech. Sequence labeling represents the upstream task of many natural language processing tasks, such as machine translation, text classification, and sentiment classification. In this paper, the focus is on the sequence labeling task, in which semantic labels are assigned to each unit of a given input sequence. Two frameworks of latent variable conditional random fields (CRF) models (called LVCRF-I and LVCRF-II) are proposed, which use the encoding schema as a latent variable to capture the latent structure of the hidden variables and the observed data. Among the two designed models, the LVCRF-I model focuses on the sentence level, while the LVCRF-II works in the word level, to choose the best encoding schema for a given input sequence automatically without handcraft features. In the experiments, the two proposed models are verified by four sequence prediction tasks, including named entity recognition (NER), chunking, reference parsing and POS tagging. The proposed frameworks achieve better performance without using other handcraft features than the conventional CRF model. Moreover, these designed frameworks can be viewed as a substitution of the conventional CRF models. In the commonly used LSTM-CRF models, the CRF layer can be replaced with our proposed framework as they use the same training and inference procedure. The experimental results show that the proposed models exhibit latent variable and provide competitive and robust performance on all three sequence prediction tasks.
Enhancing deep neural networks via multiple kernel learning
2020, Pattern Recognition
Citation Excerpt :
A relevant instance is given in the context of graph processing by the work in [31], in which graph kernels are used to pre-train a siamese network for graph classification, showing promising results. Finally, we note that the concept (and benefit) of combining different representations computed by neural networks has been explored in a recent contribution [32]. In this case, authors have extracted different feature sets by means of multiple neural networks, and then they have defined a joint representation which is integrated into the framework of Conditional Random Field for sequence labeling tasks.
Deep neural networks and Multiple Kernel Learning are representation learning methodologies of widespread use and increasing success. While the former aims at learning representations through a hierarchy of features of increasing complexity, the latter provides a principled approach for the combination of base representations. In this paper, we introduce a general framework in which the internal representations computed by a deep neural network are optimally combined by means of Multiple Kernel Learning. The resulting ensemble methodology is instantiated for Multi-layer Perceptrons architectures (both fully trained and with random-weights), and for Convolutional Neural Networks. Experimental results on several benchmark datasets concretely show the advantages and potentialities of the proposed approach.
Granular Syntax Processing with Multi-task and Curriculum Learning
2023, Research Square
A survey on syntactic processing techniques
2023, Artificial Intelligence Review

View all citing articles on Scopus

Shiliang Sun is a profess or at the School of Computer Science and Technology and the head of the Pattern Recognition and Machine Learning Research Group, East China Normal University. He received the B.E. degree in automatic control from the Department of Automatic Control, Beijing University of Aeronautics and Astronautics in 2002, and the Ph.D. degree in pattern recognition and intelligent systems from the Department of Automation and the State Key Laboratory of Intelligent Technology and Systems, Tsinghua University, Beijing, China, in 2007. In 2004, he was entitled Microsoft Fellow. From 2009 to 2010, he was a visiting researcher at the Department of Computer Science, University College London, working within the Centre for Computational Statistics and Machine Learning. From March to April 2012, he was a visiting researcher at the Department of Statistics, Rutgers University. He is a member of the PASCAL (Pattern Analysis, Statistical Modelling, and Computational Learning) network of excellence, and on the editorial boards of multiple international journals. His research interests include multi-view learning, approximate inference, Gaussian process, sequential modeling, kernel methods, and their applications.

Minzhi Yin is a profess or at the Department of Pathology, and the Director of the Department of Pathology, Shanghai Jiaotong University Medical School affiliated Shanghai Children’s Medical Center. She graduated from Shanghai Second Medical University and received the MD degree in 1993, and received the Master Degree from Shanghai Jiao Tong University School of Medicine in 2011. She spent one year as a clinical visiting scholar training in Royal Children Hospital, Melbourne, Australia from 2004 to 2005; and 3-month respective training in St. Jude Children’s Research Hospital in 2000 and Los Angeles Children’s Hospital in 2016 as an observer. Her research interests include artificial intelligence and its applications including pathology diagnosis.

Hao Yang is a senior researcher at 2012 Lab of Huawei Company Limited. He became a Member (M) of IEEE in 2005 and a Senior Member (SM) in 2009. He achieved Ph.d. degree from Beijing University of Posts and Telecommunications in 2009. His major research fields include natural language processing, neural machine translation, and deep learning in text areas.

^☆: No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.105151.

View full text

Hybrid neural conditional random fields for multi-view sequence labeling☆

Highlights

Abstract

Introduction

Section snippets

Related work

Model representation

Experiments

Conclusion

Acknowledgments

Inf. Fusion

Inform. Sci.

Knowl.-Based Syst.

Knowl.-Based Syst.

Knowl.-Based Syst.

Comput. Speech Lang.

Neural Netw.

Cogn. Sci.

Analyzing the effectiveness and applicability of co-training

Active learning with multiple views

J. Artificial Intelligence Res.

Robust co-training

Int. J. Pattern Recognit. Artif. Intell.

Predictive subspace learning for multi-view data: a large margin approach

Factorized orthogonal latent spaces

J. Mach. Learn. Res.

Soft margin consistency based scalable multi-view maximum entropy discrimination

Large-margin predictive latent subspace learning for multiview data analysis

IEEE Trans. Pattern Anal. Mach. Intell.

A survey of multi-view machine learning

Neural Comput. Appl.

Maximum entropy markov models for information extraction and segmentation

Lexicon infused phrase embeddings for named entity resolution

Natural language processing (almost) from scratch

J. Mach. Learn. Res.