摘要
XML(可扩展标记语言)正在成为Web数据交换的标准格式。随着XML格式的半结构数据的大量出现,如何处理和管理XML文档已经成为了一个研究热点。XML文档聚类作为XML数据处理的重要课题,是指将具有类似特征的XML文档聚集成簇。现有的大部分XML文档聚类是基于文档结构特征的。本文提出了一个新的结构与内容相结合的XML文档聚类方法。首先从文档中抽取构件向量,并把文档转换为向量化的表示。然后,在文档相似度计算的基础上,引入一个层次聚类方法对XML文档进行聚类。在DBLPXML记录集上进行的实验表明该方法具有可行性并且性能明显优于已有方法。
The XML (eXtensible Markup Language ) is becoming the standard format for web data exchange. With the widespread diffusion of semistructured data in XML format, processing and management of XML documents have already become a popular research issue. To be an important subject in XML data processing research, the clustering of XML documents refers to detect groups of XML documents that have similar features. Most existing methods on XML documents clustering are based on structural features. This paper proposes a new method for clustering XML documents by making use of the structural and content information of the documents. In the method, the first step is to extract component vectors from documents, and express the documents as vectors. Then a hierarchical clustering algorithm is introduced for clustering XML documents based on a document similarity function. The experiment results on DBLP XML Records show that this method is feasible and evidently better than existing methods.
出处
《情报学报》
CSSCI
北大核心
2009年第5期693-699,共7页
Journal of the China Society for Scientific and Technical Information
关键词
XML
文档聚类
结构
内容
层次聚类
XML, documents clustering, structure, content, hierarchical clustering