摘要
XML文档相似性的计算是XML文档分类中的一个难题。文中描述了一种基于结构的方法,通过序列化模式挖掘方法,挖掘出两个文档之间的最大相似路径,从而可以通过计算最大相似的路径的节点数目和所有路径的节点数目的比值,得到两个文档之间的相似度。文章提出了一种新的最小化XML文档的方法,并且综合考虑了文档节点的语义相似度和结构相似度,从而进一步地提高了计算文档相似度的精度。实验表明,该方法有着良好的应用前景。
Computing similarity between XML documents has been a big puzzle in documents classifying. This paper firstly proposes a model for computing XML documents similarity. Then it uses XMLGenerator to simulate implementing test. The paper describes a method based on structure, which uses sequential pattern mining approach to find out the maximal common paths in two XML document trees. Then we measure similarity as the ratio between maximal common paths and all paths extracted from XML document tree. A novel approach to minimize XML document is proposed and semantic similarity and structural similarity are both considered to improve similarity between two XML documents. There is a good future of our method.
出处
《计算机仿真》
CSCD
2005年第12期300-302,310,共4页
Computer Simulation
关键词
扩展标识语言
信息检索
数据挖掘
序列化模式挖掘
Extensible markup language (XML)
Information retrieval
Data mining
Sequential pattern mining