摘要
有向标记根树之间的编辑距离(TED)被广泛应用在文档的结构化相似度计算上.文中提出有向标记根树之间的语义编辑距离(TSED)的概念,并给出计算公式.组合TED和TSED形成距离测度,并应用在XML文档的结构聚类上.实验表明该距离模型在结构化聚类的准确率和召回率上明显优于单纯利用TED算法的聚类结果.该算法在时间复杂性上也等同于利用动态规划计算TED的最好算法.
In graph theory, the tree edit distance (TED) between two directed labeled and rooted trees is a popular research issue. As a combination optimization problem, calculating TED is widely used in the detection of the structural similarity of semi-structural documents. In this paper, a concept named tree semantic edit distance (TSED) with the corresponding formula is proposed. Then a distance measure based on both TED and TSED is presented. The proposed distance is applied in clustering the document object model (DOM) trees of extensible markup language (XML) documents. Experimental results show the proposed measure is better than those used TED only in terms of clustering precision and recall. The time complexity of the proposed algorithm is the same as those of algorithms for TED based on dynamic programming.
出处
《模式识别与人工智能》
EI
CSCD
北大核心
2011年第6期816-824,共9页
Pattern Recognition and Artificial Intelligence
基金
国家自然科学基金项目(No.60970047)
中国博士后科学基金项目(No.20100471503)
山东省自然科学基金项目(No.Y2008G19)
山东省科技攻关项目(No.2007GG10001002
2008GG10001026)资助
关键词
树编辑距离
文档聚类
结构相似度
语义相似性
Tree Edit Distance, Document Clustering, Structural Similarity, Semantic Similarity