规范化相似度的符号序列层次聚类

Hierarchical Clustering of Categorical Sequences by Similarity Normalization

下载PDF

导出

摘要符号序列由有限个符号按一定顺序排列而成,广泛存在于数据挖掘的许多应用领域,如基因序列、蛋白质序列和语音序列等。作为序列挖掘的一种主要方法,序列聚类分析在识别序列数据内在结构等方面具有重要的应用价值;同时,由于符号序列间相似性度量较为困难,序列聚类也是当前的一项开放性难题。首先提出一种新的符号序列相似度度量,引入长度规范因子解决现有度量对序列长度敏感的问题,从而提高了符号序列相似度度量的有效性。在此基础上,提出一种新的聚类方法,根据样本相似度构建无回路连通图,通过图划分进行符号序列的层次聚类。在多个实际数据集上的实验结果表明,采用规范化度量的新方法可以有效提高符号序列的聚类精度。 A categorical sequence is composed of finite symbols which are arranged in a certain order. Nowadays, categorical sequences, such as gene sequences, protein sequences, and speech sequences, etc. , widely exist in many application domains of data mining. As a major method for sequence data mining, sequence clustering has a great value in identifying the intrinsic structural of sequence data, while it is also an open problem due to the difficulties in measuring the similarity between sequences. This paper proposed a new similarity measure for categorical sequences, and introduced a length-normalization factor to address the problem that the existing methods are sensitive to the sequences length, and to improve the effectiveness of measuring sequences similarity. Based on the new similarity measure, a new clustering method was proposed, where directed acyclic graphs are constructed according to the similarity between samples and a hierarchical clustering of categorical sequences is performed by graph partitioning. Experimental results on real-world datasets show that the new methods based on the normalized similarity measure are able to improve the clustering accuracy significantly.

作者张豪陈黎飞郭躬德

机构地区福建师范大学数学与计算机科学学院福建省网络安全与密码技术重点实验室

出处《计算机科学》 CSCD 北大核心 2015年第5期114-118,141,共6页 Computer Science

基金国家自然科学基金(61175123) 深圳市基础研究(重点)项目(JCYJ20120617120716224)资助

关键词符号序列聚类相似度规范化因子 Categorical sequence, Clustering, Similarity, Normalized variant

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献18

1Xiong T,Wang S,Jiang Q,et al.A new Markov model for clustering categorical sequences[C]∥Proceedings of the International Conference on Data Mining (ICDM).2011:854-863.
2Dong Guo-zhu,Pei Jian.Sequence Data Mining[M].New York:Springer-Verlag New York Inc.,2007:1-65.
3孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008(1):48-61. 被引量：1074
4Kondrak G.N-gram similarity and distance[C]∥String Proces-sing and Information Retrieval.2005:115-126.
5Ron D,Singer Y,Tishby N.The power of amnesia:Learningprobabilistic automata with variable memory length[J].Machine learning,1996,25(2/3):117-149.
6Kelil A,Wang S,Brzezinski R,et al.CLUSS:Clustering of protein sequences based on a new similarity measure[J].BMC bioinformatics,2007,8(1):286.
7Kelil A,Wang S.SCS:A new similarity measure for categorical sequences[C]∥International Conference on Data Mining.2008:343-352.
8ALPAYDIN E.机器学习导论[M].北京:机械工业出版社,2009:245-251.
9Grossi R,Vitter J.Compressed suffix arrays and suffix treeswith applications to text indexing and string matching[C]∥Proc.of ACM STOC.2000:397-406.
10Gusfield D.Algorithms on strings,trees,and sequences[J].ACM SIGACT News,1997,28(4):41-60.

二级参考文献1

1李洁,高新波,焦李成.基于特征加权的模糊聚类新算法[J].电子学报,2006,34(1):89-92. 被引量：114

共引文献1076

1丁小军,陈杰,李霖,徐碧通,朱晓姝.一种基于聚类结果稳定性来确定聚类数的方法[J].玉林师范学院学报,2020(3):43-47. 被引量：1
2王玥,李文权,梁爽,余静财.基于改进聚类算法的共享汽车网点选址研究[J].武汉理工大学学报,2021,43(2):79-85.
3林耿堃,盛积良.乡村振兴时代背景下农民消费结构变迁研究[J].农业农村部管理干部学院学报,2021(2):76-81. 被引量：3
4高显义,林欣晖.基于文本聚类的变电工程变更特征识别研究[J].建筑经济,2020,41(S02):200-203. 被引量：2
5毛颖颖,杨新凯.融合拓扑势的自适应层次聚类算法研究[J].计算机应用研究,2020,37(S01):37-39.
6张睿恺,吴克河.基于优化特征集的LeNet-5攻击检测模型的态势感知技术[J].计算机应用研究,2020,37(S01):287-289. 被引量：3
7李对红,王裴岩 ,张桂平,张少阳.基于字簇的多模型中文分词方法研究[J].计算机应用研究,2020,37(2):355-359. 被引量：2
8尧少波,蒋励剑,赵文文,卢铮,吴昌聚,陈伟芳.耦合聚类的数据驱动稀薄流非线性本构计算方法[J].航空学报,2022,43(S02):43-56.
9段桂芹.基于改进密度的簇内均值最小距离聚类算法[J].智能计算机与应用,2021,11(12):82-86. 被引量：1
10何睿,余娜,李淼,张峻巍,王浩杰,赵玉茗.基于单细胞RNA测序数据的细胞类型聚类算法[J].智能计算机与应用,2020,10(7):104-108. 被引量：2

1袁清珂,张明天,冯桑.电阻点焊的变论域模糊控制方法[J].控制理论与应用,2010,27(3):387-390. 被引量：5
2蒋式勤,陈海洪.模糊神经网络技术用于结构主动控制的研究[J].同济大学学报（自然科学版）,1998,26(5):605-608. 被引量：4
3曾宪华,罗建,藤华.规范化自然梯度ICA算法[J].西华师范大学学报（自然科学版）,2007,28(1):57-61.
4王晓云,陈良生.旋转变化的人耳识别研究[J].计算机工程,2011,37(S1):208-210.
5梅海彬,龚俭,张明华.基于警报序列聚类的多步攻击模式发现研究[J].通信学报,2011,32(5):63-69. 被引量：18
6邓剑勋,邢永康.从文档集推导html标签影响因子的算法[J].计算机科学,2007,34(10):226-228.
7艾英山,张德贤.基于文本和类别信息的KNN文本分类算法[J].计算机与数字工程,2009,37(11):10-12. 被引量：2
8孙荣宗,苗夺谦,卫志华,李文.基于粗糙集的快速KNN文本分类算法[J].计算机工程,2010,36(24):175-177. 被引量：22
9赵军,胡栓柱,樊兴华.一种新的词语相似度计算方法[J].重庆邮电大学学报（自然科学版）,2009,21(4):528-532. 被引量：10
10邵伟明,田学民,王平.基于递推PLS核算法的软测量在线学习方法[J].化工学报,2012,63(9):2887-2891. 被引量：9

计算机科学

2015年第5期

浏览历史

内容加载中请稍等...

规范化相似度的符号序列层次聚类

参考文献18

二级参考文献1

共引文献1076

相关作者

相关机构

相关主题

浏览历史