摘要
针对人类短编码序列的识别问题,根据碱基在密码子三个位置的偏性和碱基自身物理化学性质的分类,提出一种新的图形表示方法——YKW图形,然后在此图形上,提取了9个有效的面积矩阵特征,识别过程中,为了提高识别率利用递增特征选择算法添加4个统计特征,并采用主元分析(PCA)方法对这13个特征降维,最后使用支持向量机(SVM)对人类的短编码序列进行编码区/非编码区识别。实验结果表明,与其他方法相比,该方法使用较少的特征(7个或4个)取得了更好的识别结果。
According to base bias in the three positions of codon and base chemical properties,the YKW graph,a new graphical representation of gene sequences was introduced for recognizing short coding sequences of human genes.Nine effective features of area matrix were extracted in the YKW curves.In the identifying process,the incremental feature selection algorithm was used to add four statistical features to improve the accuracy.Then Principal Component Analysis(PCA) method was adopted to reduce dimensions and Support Vector Machine(SVM) was applied to classify the coding/un-coding sequence in short human genes.Finally,the experimental results show that the proposed method uses fewer features(seven or four) and gets better recognition results than other methods.
出处
《计算机应用》
CSCD
北大核心
2011年第8期2087-2091,共5页
journal of Computer Applications
基金
国家自然科学基金资助项目(60873184)
湖南省自然科学基金资助项目(07JJ5086)
关键词
图形表达
短编码序列识别
面积矩阵
基因序列
graphical representation
short coding sequence identification
area matrix
gene sequence