摘要
聚类算法广泛应用于生物信息学数据分析中,是基因序列和表达数据分析研究的主要技术之一。提出了一种基于向量空间模型的基因序列聚类分析算法。首先利用DNA序列的结构特征,将多个DNA序列构成序列集。结合向量空间模型算法,计算DNA序列集中两两序列之间的相似度矩阵,并选取适当的阈值对相似度矩阵作截集处理,从而得到最终的聚类结果。基于DNA序列数据的仿真实验结果表明,该算法在基因序列的分析中是实用、有效的,并且具有算法简明、语义准确、向量维数可控等优点。
Clustering algorithms, which is one of the main techniques for analyzing gene sequences and expression data, are widely applied in the research of bioinformatics data. A clustering algorithm for gene sequences analysis based on vector space model is proposed in this paper. Firstly, according to the structure characteristics of gene sequences, different bases in DNA are used to construct the DNA sequences which consist of the DNA sequences set. Then the similarity matrix between DNA sequences is computed using the vector space model algorithm. The final cluster results are obtained by choosing the proper threshold for the similarity matrix over cuts. Simulation results on the DNA sequences data have shown the vector space model algorithm is veryfeasible, efficient in gene sequences analysis. The presented algorithm has the advantages of conciseness, semantic accuracy and the controllable dimension of the vector.
出处
《微计算机信息》
2010年第16期155-157,共3页
Control & Automation
关键词
基因序列
向量空间模型
聚类分析
gene sequences
vector space model
cluster analysis