摘要
现有序列相似性度量算法在子序列相似性度量中仅考虑其局部相似度,忽略了其所属序列的整体结构信息。为此,提出一种以单个符号的熵为基础的序列相似性度量方法,根据同一序列中相同符号的位置及个数信息得出符号熵。通过凝聚型层次聚类结果验证序列相似性度量方法,在多个领域的符号序列数据集上的实验结果表明,与现有的基于子序列局部相似性方法相比,该相似性度量方法有效提高了聚类结果质量。
Existing sequence similarity measurement algorithms only consider the local similarity of subsequences, ignoring global structure information. Thus,a similarity measurement method based on the entropy of single symbol for sequences is proposed. The entropy of a symbol is computed according to the positions and numbers of all the same symbols in a sequence. Through verifying the validity of the new sequence similarity measurement method by agglomerative hierarchical clustering, experimental results on a plurality of datasets show that, compared with the existing methods based on local similarity of substring, the new similarity measurement method can improve the clustering accuracy significantly.
出处
《计算机工程》
CAS
CSCD
北大核心
2016年第5期201-206,212,共7页
Computer Engineering
基金
国家自然科学面上基金资助项目"面向软件行为鉴别的事件序列挖掘方法研究"(61175123)
福建师范大学创新团队基金资助项目(IRTL1207)
关键词
符号序列
相似度
熵
层次聚类
序列聚类
symbol sequence
similarity
entropy
hierarchical clustering
sequence clustering