期刊文献+

一种结构与内容相结合的XML文档聚类方法 被引量:4

A Clustering Method Based on Structure and Content for XML Documents
下载PDF
导出
摘要 XML(可扩展标记语言)正在成为Web数据交换的标准格式。随着XML格式的半结构数据的大量出现,如何处理和管理XML文档已经成为了一个研究热点。XML文档聚类作为XML数据处理的重要课题,是指将具有类似特征的XML文档聚集成簇。现有的大部分XML文档聚类是基于文档结构特征的。本文提出了一个新的结构与内容相结合的XML文档聚类方法。首先从文档中抽取构件向量,并把文档转换为向量化的表示。然后,在文档相似度计算的基础上,引入一个层次聚类方法对XML文档进行聚类。在DBLPXML记录集上进行的实验表明该方法具有可行性并且性能明显优于已有方法。 The XML (eXtensible Markup Language ) is becoming the standard format for web data exchange. With the widespread diffusion of semistructured data in XML format, processing and management of XML documents have already become a popular research issue. To be an important subject in XML data processing research, the clustering of XML documents refers to detect groups of XML documents that have similar features. Most existing methods on XML documents clustering are based on structural features. This paper proposes a new method for clustering XML documents by making use of the structural and content information of the documents. In the method, the first step is to extract component vectors from documents, and express the documents as vectors. Then a hierarchical clustering algorithm is introduced for clustering XML documents based on a document similarity function. The experiment results on DBLP XML Records show that this method is feasible and evidently better than existing methods.
出处 《情报学报》 CSSCI 北大核心 2009年第5期693-699,共7页 Journal of the China Society for Scientific and Technical Information
关键词 XML 文档聚类 结构 内容 层次聚类 XML, documents clustering, structure, content, hierarchical clustering
  • 相关文献

同被引文献41

  • 1潘有能.XML文档自动聚类研究[J].情报学报,2006,25(2):215-220. 被引量:16
  • 2孔令波,唐世渭,杨冬青,王腾蛟,高军.XML数据的查询技术[J].软件学报,2007,18(6):1400-1418. 被引量:72
  • 3Han J,Kamber M.Data Mining: Concepts and Techniques[M].San Francisco,USA: Morgan Kaufmann Publishers,2006.
  • 4Cohn D,Hofmann T.The Missing Link——A Probabilistic Model of Document Content and Hypertext Connectivity[C]//Proc.of Advances in Neural Information Processing Systems.Cambridge,USA: MIT Press,2001: 430-436.
  • 5Weiss R,Velez B,Sheldon M,et al.Hypursuit: A Hierarchical Network Search Engine That Exploits Content Link Hypertext Clustering[C]//Proc.of the 7th ACM Conference on Hypertext.New York,USA: ACM Press,1996: 180-193.
  • 6Modha D,Spangler W.Clustering Hypertext with Applications to Web Searching[C]//Proc.of the 11th ACM Conference on Hypertext and Hypermedia.San Antonio,USA: ACM Press,2000: 123-132.
  • 7GB/T7714-2005文后参考文献著录规则[S].北京:中国标准出版社,2005
  • 8Lee J W, Lee K, Kim W. Preparations for Semantics-Based XML Mining [ C ]//Proceedings of the 2001 IEEE international conference on data mining, San Jose, Cali- fornia, USA, 2001.
  • 9Doucet A. Naive Clustering of a large XML Document Collection [ C ]//Proceedings of the 1 st Annual Workshopof the Initiative for the Evaluation of XML retrieval (INEX) , Dagstuhl, Germany,2002.
  • 10Lian W,Cheung D W, Mamoulis N,et al. An Efficient and Scalable Algorithm for Clustering XML Documents by Structure[ J]. IEEE Transactions on Knowledge and Data Engineering ,2004,16( 1 ) :82-96.

引证文献4

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部