
一种基于结构信息总结树的XML文档聚类方法 被引量:1

Clustering XML Documents Based on a Structural Summary Tree
摘要 提出一种有效的XML文档结构信息表达方法,用数字化的结构总结树SST对XML文档的结构信息进行编码,在此基础上给出结构距离的定义,并采用遗传算法对XML文档进行聚类.实验证明该方法分类准确率高,易于实现,且不需先验的DTD知识. An approach for calculating the structural similarity between XML documents is proposed in this paper. The structural information of an XML document is captured with a structural summary tree (SST). By encoding elements as digital numbers, a SST is transformed to a digit-labeled tree. Digital numbers at different tree levels are concatenated to form a vector after the normalization process. Consequently, each XML document is represented as an m-dimension vector. The GA-based clustering algorithm is adopted since it is able to provide good results irrespective of the starting configuration. Experimental results show the effectiveness and scalability of the approach.
出处 《应用科学学报》 CAS CSCD 北大核心 2005年第1期71-74,共4页 Journal of Applied Sciences
关键词 XML文档 结构信息 聚类方法 DTD 遗传算法 编码 准确率 实验证明 距离 表达方法 XML information retrieval document clustering GA SST(structure summary tree)
  • 相关文献


  • 1.Sigmod XML数据集[EB/OL].Available at:http://www.acm.org/sigmod/record/xml,.
  • 2.莎士比亚戏剧XML数据集.Shakespeare XML data sets[EB/OL].Available at: http://metalab.unc.edu/bosak/xml/eg,.
  • 3.Religion XML 数据集[EB/OL].Available at: http://www. prweb.com/xml/religion. xml,.
  • 4Bray T, Paoli J, Sperberg-McQueen C M. Extensible Markup Language (XML)1.0 [EB/CD]. W3C Recommendation, http://www. w3. org/TR/1998/REC-xml- 1980210.
  • 5Cobena G, Abiteboul S, Marian A. Detecting changes in XML document [A]. In 18th Int 1 Conf on Data Engineering(ICDE 2002) [C]. 2002.
  • 6Chawathe S S, Rajaraman A, Garcia-Molina H, et al.Change detection in hierarchically structured information[A]. In Procs of the Int' 1 Conf on Management of Data(SIGMOD'96) [C]. 1996.493-504.
  • 7Zhang K, Shasha D. Simple fast algorithms for the editing distance between trees and related problems [ J]. SIAM J Comput, 1989, 18(6): 1245 - 1262.
  • 8Bertino E, Guerrini G, Mesiti M. Measuring the structural similarity among XML documents and DTDs [ EB/CD].Technical Report DISI-TR-02-02, Department of Computer Science, University of Genova, 2002. http://www. disi.unige. it/person/MesitiM.
  • 9Flesca S, Manco G, Masciari E, et al. Detecting structural similarities between XML documents[A]. Proceedings of the Fifth International Workshop on the Web and Databases[ C]. WebDB 2002, Madison, Wisconsin, USA, June 6-7,2002, in conjunction with ACM PODS/SIGMOD 2002.Informal proceedings.
  • 10Goldberg D E. Genetic Algorithms in Search, Optimization and Machine Learning [ M ]. New York: Addison-Wesley,1989.











使用帮助 返回顶部