Predicting protein-binding RNA nucleotides using the feature-based removal of data redundancy and the interaction propensity of nucleotide triplets

Title
Predicting protein-binding RNA nucleotides using the feature-based removal of data redundancy and the interaction propensity of nucleotide triplets
Authors
한경숙
Keywords
Protein?RNA interaction Protein-binding nucleotide Data redundancy removal Interaction propensity
Issue Date
2013
Publisher
COMPUTERS IN BIOLOGY AND MEDICINE
Series/Report no.
COMPUTERS IN BIOLOGY AND MEDICINE ; Vol43 no.11 Startpage 1687 Endpage 1697
Abstract
Several learning approaches have been used to predict RNA-binding amino acids in a protein sequence, but there has been little attempt to predict protein-binding nucleotides in an RNA sequence. One of the reasons is that the differences between nucleotides in their interaction propensity are much smaller than those between amino acids. Another reason is that RNA exhibits less diverse sequence patterns than protein. Therefore, predicting protein-binding RNA nucleotides is much harder than predicting RNA- binding aminoacids. We developed a new method that removes data redundancy in a training set of sequences based on their features. The new method constructs a larger and more informative training set than the standard redundancy removal method based on sequence similarity, and the constructed dataset is guaranteed to be redundancy-free. We computed the interaction propensity (IP) of nucleotide triplets by applying a new definition of IP to an extensive dataset of protein-RNA complexes, and developed a support vector machine (SVM) model to predict protein binding sites in RNA sequences. In a 5-fold cross-validation with 812 RNA sequences, the SVM model predicted protein-binding nucleotides with an accuracy of 86.4%, an F-measure of 84.8%, and a Matthews correlation coefficient of 0.66. With an independent dataset of 56 RNA sequences that were not used in training, the resulting accuracy was 68.1% with an F-measure of 71.7% and a Matthews correlation coefficient of 0.35. To the best of our knowledge, this is the first attempt to predict protein-binding RNA nucleotides in a given RNA sequence from the sequence data alone. The SVM model and datasets are freely available for academics at http://bclab.inha.ac.kr/primer.
URI
http://dx.doi.org/10.1016/j.compbiomed.2013.08.011
http://dspace.inha.ac.kr/handle/10505/33503
ISSN
0010-4825
Appears in Collections:
College of Engineering(공과대학) > Computer Engineering (컴퓨터공학) > Journal Papers, Reports(컴퓨터정보공학 논문, 보고서)
Files in This Item:
35640.pdfDownload

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse