摘要
针对现有的树聚类算法不能适应数据的动态变化和不确定性等问题,研究不确定数据的聚类问题,提出一种在不确定树数据库中的动态聚类算法,有效地解决了因数据的动态变化而导致的无法聚类的问题.首先,提出转变树集、相似分组和树类集等概念来描述一个不确定树数据库的聚类模型.其次,为了更加准确的度量子树之间的相似性,考虑到子树即具有结点语义特征,又具有结构化特性,提出了一种语义相似度计算方法与结构相似度计算方法,同时对两者赋予一定比例的权值并求和得到最终的相似度.再次,设计了一个动态聚类过程,采用自适应获取聚类阈值,较大程度上减少了人为干扰导致聚类结果不准确的影响,使得具有相似结构的子树聚集在同一个相似分组中,不同分组之间的子树相似度达到最小化,同时对每个相似分组,定义一个提取代表性子树的公式,将其作为树类组成树的类集.最后,通过模拟数据和真实环境两部分实验可以表明,算法有效可行,聚类结果较准确且具有较好的运行效率.
Considering the dis - applicability to dynamic variation, uncertainty and other problems of present tree clustering algorithm, the research on uncertain data clustering and proposal of a dynamic algorithm in uncertain tree database have effectively investigated the clustering problems result from dynamic database. First, the cluster mode of an uncertain tree database is described by introduction of conceptions of tree set change, similar group and tree class set. Second, in order to do accurate measurement on the similarities a- mong subtrees, the calculation method of semantic similarity and structural similarity are proposed for subtree's node semantic charac- teristic and structured characteristic. In addition, proper weight is distributed to both similarities and accumulated to evaluate the final similarities. Third, a dynamic clustering process is designed in which threshold can be captured self - adaptively so that greatly reduce the jamming impact to the result accuracy. This process can cluster subtrees of similar structure within similar groups , which can minimize the similarity of subtree groups, and define a formula to single out the representatives in groups and qualify the representa- fives as tree classes which can be combined as tree class set. In the end, through experiment by analog data and reality, it turns out that the algorithm is effective and feasible. The clustering result is accurate and can run efficiently.
出处
《小型微型计算机系统》
CSCD
北大核心
2013年第6期1339-1343,共5页
Journal of Chinese Computer Systems
基金
湖南省教育厅科学研究项目(12CD291
11C1051)资助
吉首大学校级科研计划项目(11JD051)资助
关键词
数据挖掘
有序树
频繁子树
相似度
不确定树
聚类
data mining
ordered tree
frequent subtree
similarity
uncertain tree
cluster