摘要
符号序列由有限个符号按一定顺序排列而成,广泛存在于数据挖掘的许多应用领域,如基因序列、蛋白质序列和语音序列等。作为序列挖掘的一种主要方法,序列聚类分析在识别序列数据内在结构等方面具有重要的应用价值;同时,由于符号序列间相似性度量较为困难,序列聚类也是当前的一项开放性难题。首先提出一种新的符号序列相似度度量,引入长度规范因子解决现有度量对序列长度敏感的问题,从而提高了符号序列相似度度量的有效性。在此基础上,提出一种新的聚类方法,根据样本相似度构建无回路连通图,通过图划分进行符号序列的层次聚类。在多个实际数据集上的实验结果表明,采用规范化度量的新方法可以有效提高符号序列的聚类精度。
A categorical sequence is composed of finite symbols which are arranged in a certain order. Nowadays, categorical sequences, such as gene sequences, protein sequences, and speech sequences, etc. , widely exist in many application domains of data mining. As a major method for sequence data mining, sequence clustering has a great value in identifying the intrinsic structural of sequence data, while it is also an open problem due to the difficulties in measuring the similarity between sequences. This paper proposed a new similarity measure for categorical sequences, and introduced a length-normalization factor to address the problem that the existing methods are sensitive to the sequences length, and to improve the effectiveness of measuring sequences similarity. Based on the new similarity measure, a new clustering method was proposed, where directed acyclic graphs are constructed according to the similarity between samples and a hierarchical clustering of categorical sequences is performed by graph partitioning. Experimental results on real-world datasets show that the new methods based on the normalized similarity measure are able to improve the clustering accuracy significantly.
出处
《计算机科学》
CSCD
北大核心
2015年第5期114-118,141,共6页
Computer Science
基金
国家自然科学基金(61175123)
深圳市基础研究(重点)项目(JCYJ20120617120716224)资助
关键词
符号序列
聚类
相似度
规范化因子
Categorical sequence, Clustering, Similarity, Normalized variant