摘要
核心启动子是DNA转录起始位上游一段可以与引发转录但又不被转录的关键序列。目前启动子预测已经有许多的研究,但预测的准确程度仍有待提高。支持向量机是主要用于分类的一种机器学习方法,它在解决小样本、非线性及高维模式识别中表现出许多特有的优势。本研究采用支持向量机的机器学习方法,以多聚体词频统计和核酸序列正交编码两种特征提取方式,使用公共数据库提供的测试数据,对若干包括启动子和非启动子的大量核酸样本序列进行启动子识别和10折的交叉校验研究,发现核酸序列正交编码方式预测的准确率优于多聚体方式,而支持向量机的四种核函数中RBF核函数预测的准确性最高。研究还发现采用不同的序列长度(起始从-249到-100),预测准确度都接近,反映启动子与非启动子的核酸序列模式差异主要位于转录起始位上游-100到下游+50这段区间内,启动子序列中碱基G和C出现的频率显著性高于A和T,而非启动子正好相反。
The core pro moter is the key DNA sequence located at the upstream of the DNA transcription start site, which can initiate transcription but not be transcribed. There are plenty studies of the prediction of promoters,but the accuracy still needs to be improved. Support vector machine is a kind of machine learning method for classification, which shows its special advantage in solving small sample set, nonlinear and high dimensional pattern recognition. In this paper, the machine learning method of support vector machine, and two feature extracting ways including of K-mer word frequency statistics and nucleic acid sequence orthogonal coding, are used for the core promoter identification. The training data including of large samples of promoter sequence and Non-promoters sequence were downloaded from public databases. Then 10- fold cross validation study was carried out. The results showed that nucleic acid sequence orthogonal coding is more accurate than the K-mer word frequency statistics in promoter recognition, and the RBF core function is most accurate for prediction in the four types core functions of support vector machine. It also can be found that the prediction accuracy of sequence with different length upstream of transcription start site(TSS, from- 249 to- 100) are close. This phenomena reveals that the pattern difference between the promoter and Non-promoter mainly relies on the sequence interval from the upstream of the TSS- 100 to the downstream of TSS 50. The GC content in promoter sequence was significantly higher than that of A and T, but that of the Non-promoter was in the opposite site.
出处
《基因组学与应用生物学》
CAS
CSCD
北大核心
2016年第7期1675-1680,共6页
Genomics and Applied Biology
基金
国家自然科学基金项目<基因调控序列的信息学识别及若干肿瘤相关基因调控序列的确定>(60601017)资助
关键词
核心启动子
支持向量机
识别
Core promoter
Support vector machine
Identification