摘要
随着互联网的迅速发展,XML已经成为互联网中最常用的数据交换与存储语言,如何从大量的XML文档中提取有价值的信息是目前的研究热点之一.本文提出了一种基于SET/BAG模型的改进的相似度计算方法.该方法将XML文档的每个节点转换成一个对象(由对象名、父对象、属性集合以及该对象相对于其父对象的权重组成),能较完整地表达XML文档的结构信息,并且通过调整重复节点的权重来降低其在相似度计算中的影响.在真实数据集与人工数据集上分别进行实验,仿真实验结果表明,本文提出的基于SET/BAG模型下改进的相似度计算方法能得到很好的聚类结果.
With the rapid development of Internet,XML has become the most commonly used language for the Internet data exchange and storage. How to extract valuable information from a large number of XML document is one of the hottest research topics currently. This paper proposes a model based on the SET / BAG improved similarity calculation method,which converts each node of the XML document to an object( the object name,object,attribute set,and the weight of the object relative to the parent object) and can fully express the structure of an XML document information,by adjusting the repeated node weights to reduce its influence in similarity calculation.Based on real data sets and artificial datasets experiments respectively,the simulation experimental results show that the proposed method in this paper based on the SET / BAG model improved similarity calculation can get good clustering results.
出处
《湖南师范大学自然科学学报》
CAS
北大核心
2015年第5期91-94,共4页
Journal of Natural Science of Hunan Normal University
关键词
XML
文档聚类
相似度计算
XML
document clustering
similarity computation