期刊文献+

规范化相似度的符号序列层次聚类

Hierarchical Clustering of Categorical Sequences by Similarity Normalization
下载PDF
导出
摘要 符号序列由有限个符号按一定顺序排列而成,广泛存在于数据挖掘的许多应用领域,如基因序列、蛋白质序列和语音序列等。作为序列挖掘的一种主要方法,序列聚类分析在识别序列数据内在结构等方面具有重要的应用价值;同时,由于符号序列间相似性度量较为困难,序列聚类也是当前的一项开放性难题。首先提出一种新的符号序列相似度度量,引入长度规范因子解决现有度量对序列长度敏感的问题,从而提高了符号序列相似度度量的有效性。在此基础上,提出一种新的聚类方法,根据样本相似度构建无回路连通图,通过图划分进行符号序列的层次聚类。在多个实际数据集上的实验结果表明,采用规范化度量的新方法可以有效提高符号序列的聚类精度。 A categorical sequence is composed of finite symbols which are arranged in a certain order. Nowadays, categorical sequences, such as gene sequences, protein sequences, and speech sequences, etc. , widely exist in many application domains of data mining. As a major method for sequence data mining, sequence clustering has a great value in identifying the intrinsic structural of sequence data, while it is also an open problem due to the difficulties in measuring the similarity between sequences. This paper proposed a new similarity measure for categorical sequences, and introduced a length-normalization factor to address the problem that the existing methods are sensitive to the sequences length, and to improve the effectiveness of measuring sequences similarity. Based on the new similarity measure, a new clustering method was proposed, where directed acyclic graphs are constructed according to the similarity between samples and a hierarchical clustering of categorical sequences is performed by graph partitioning. Experimental results on real-world datasets show that the new methods based on the normalized similarity measure are able to improve the clustering accuracy significantly.
出处 《计算机科学》 CSCD 北大核心 2015年第5期114-118,141,共6页 Computer Science
基金 国家自然科学基金(61175123) 深圳市基础研究(重点)项目(JCYJ20120617120716224)资助
关键词 符号序列 聚类 相似度 规范化因子 Categorical sequence, Clustering, Similarity, Normalized variant
  • 相关文献

参考文献18

  • 1Xiong T,Wang S,Jiang Q,et al.A new Markov model for clustering categorical sequences[C]∥Proceedings of the International Conference on Data Mining (ICDM).2011:854-863.
  • 2Dong Guo-zhu,Pei Jian.Sequence Data Mining[M].New York:Springer-Verlag New York Inc.,2007:1-65.
  • 3孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008(1):48-61. 被引量:1074
  • 4Kondrak G.N-gram similarity and distance[C]∥String Proces-sing and Information Retrieval.2005:115-126.
  • 5Ron D,Singer Y,Tishby N.The power of amnesia:Learningprobabilistic automata with variable memory length[J].Machine learning,1996,25(2/3):117-149.
  • 6Kelil A,Wang S,Brzezinski R,et al.CLUSS:Clustering of protein sequences based on a new similarity measure[J].BMC bioinformatics,2007,8(1):286.
  • 7Kelil A,Wang S.SCS:A new similarity measure for categorical sequences[C]∥International Conference on Data Mining.2008:343-352.
  • 8ALPAYDIN E.机器学习导论[M].北京:机械工业出版社,2009:245-251.
  • 9Grossi R,Vitter J.Compressed suffix arrays and suffix treeswith applications to text indexing and string matching[C]∥Proc.of ACM STOC.2000:397-406.
  • 10Gusfield D.Algorithms on strings,trees,and sequences[J].ACM SIGACT News,1997,28(4):41-60.

二级参考文献1

共引文献1076

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部