ABSTRACT
Machine Learning (ML)-based classification of protein characteristics from primary sequences is an important tool for exploring candidate proteins in targeted drug discovery, mutational analysis, and functional identification. However, ML feature selection requires extensive manual curation and knowledge of protein chemistry, interactions, and micro-environment of the proteins of interest. Current approaches include amino acid composition strategies, specific motif analysis or Quantitative Structure-Activity Relationship (QSAR)-based feature generation methods. In contrast, we propose an automated generalized feature generation method based on Natural Language Processing (NLP), using a modified combination of N-Gram and Skip-Gram models (m-NGSG). Optimal parameters are selected using an adapted grid search algorithm, enabling a high-throughput global application of our approach. A meta-comparison of logistic regression mediated classification approaches exploiting m-NGSG with other published models illustrates enhanced functional and structural binary and multi-class classification accuracy in every instance. The lack of dependence on detailed physicochemical knowledge makes the m-NGSG approach ideal for the exploration of protein characteristics recalcitrant to previous approaches without any loss in predictive accuracy. A further test on prediction quality of m-NGSG on cationic channel blockers with 70% sequence identity from Arthropods demonstrated 94.10% and 92.30% accuracy on the training and test set, respectively. The latter study demonstrates the applicability of m-NGSG model on a functional classification of proteins employing a novel dataset.Thus, without the requirement of expert intervention for optimal feature selection, it is hoped that this automated feature generation approach will reduce the time needed to employ ML classification strategies for prediction of protein characteristics.
Index Terms
- Protein Classification using Modified N-Gram and Skip-Gram Models: Extended Abstract
Recommendations
A protein sequence meta-functional signature for calcium binding residue prediction
The diversity of characterized protein functions found amongst experimentally interrogated proteins suggests that a vast array of unknown functions remains undiscovered. These protein functions are imparted by specific geometric distributions of amino ...
Ranking docked models of protein-protein complexes using predicted partner-specific protein-protein interfaces: a preliminary study
BCB '11: Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and BiomedicineComputational protein-protein docking is a valuable tool for determining the conformation of complexes formed by interacting proteins. Selecting near-native conformations from the large number of possible models generated by docking software presents a ...
Comments