摘要
针对符号序列聚类中表示模型及序列间距离度量定义的困难问题,提出一种基于概率向量的表示模型及基于该模型的符号序列聚类算法。该模型引入符号序列的概率分布表示法,定义了一种基于概率分布差异的符号序列距离度量及该模型的目标函数,最后给出了一种符号序列K-均值型聚类算法,并在来自不同领域的实际应用序列集上进行了实验验证。实验结果表明,与基于子序列表示模型的符号序列聚类算法相比,所提方法在DNA序列和语音序列等具有较多符号的实际数据上,在有效提高聚类精度的同时降低聚类时间50%以上。
This paper proposed a representation model using probability vectors of symbolic sequences and a new clustering algorithm based on the model,to address the difficult problems in defining an efficient representation as well as a meaningful distance measure for symbolic sequences clustering. It proposed a probability-distribution-based representation method for symbolic sequences,on which first defined a new distance measure computed on the dissimilarity of the probability distributions,and also defined a clustering criterion for sequences clustering with the probability vector space model. Finally,it described a Kmeans-type algorithm for symbolic sequences clustering,and conducted a series of experiments on real-world sequence sets from various domains to evaluate its performance. The experimental results show that,on both gene sequences and speech sequences consisting of a relatively large number of symbols,the proposed method improves the clustering accuracy effectively with more than 50% decrease in the clustering time,compared with the existing algorithms using a subsequence-based representation model.
作者
程铃钫
陈黎飞
Cheng Lingfang;Chen Lifei(Jinshan College of Fujian Agriculture & Forestry University,Fuzhou 350002,China;School of Mathematics & Computer Science,Fujian Normal University,Fuzhou 350117,China)
出处
《计算机应用研究》
CSCD
北大核心
2018年第6期1676-1680,共5页
Application Research of Computers
基金
国家自然科学基金资助项目(61672157)