期刊文献+

基于序列前缀技术的XML频繁路径挖掘算法

Prefix-Based XML Frequent Path Mining Algorithm
下载PDF
导出
摘要 XML文档是半结构化数据,对其进行频繁路径挖掘可以分为两步:XML文档序列化和序列挖掘阶段.现有的序列化方式将XML文档表示为Xpath路径集合,其中有大量的节点冗余;序列挖掘阶段采用的类Apriori算法需要多次扫描数据库并产生大量的候选集,采用的Prefix Span算法会产生大量的投影数据库,占用较大的内存.针对以往XML频繁路径挖掘算法存在的不足,本文提出一种高效的挖掘算法——基于序列前缀技术的XML频繁路径挖掘算法(PXFP,Prefix-based XML Frequent Path Mining Algorithm).PXFP算法以广度优先方式遍历XML文档树并将每个节点表示为"节点:父节点"的形式,这种序列化的方式减少了节点冗余.在序列挖掘阶段借鉴Prefix Span算法中前缀的概念,但不产生投影数据库,仅得到直接后缀(即前缀的子节点),通过记录频繁子路径的位置信息逐渐扩大频繁模式的长度,位置信息的引入减少了对数据库的扫描.实验结果表明,PXFP算法取得了比Prefix Span算法更高的时间和空间效率. XML documents are semi-structured data, and XML frequent path mining can be divided into two steps: XML document serialization and sequence mining. The existing serialization method expresses the XML document as a set of Xpath paths with a plenty of node redundancy. Algorithms based on Apriori require multiple scanning of the database and can generate a large number of candidate sets. The PrefixSpan algorithm generates a large number of projection databases, occupying a lot of memory space. In view of the shortcomings of the existing algorithms used in XML frequent path mining, this paper proposes an efficient mining algorithm called Prefix-based XML Frequent Path Mining Algorithm (PXFP). The PXFP algorithm traverses the XML document tree in a breadth-first manner and represents each node as "node: parent node", which reduces the node redundancy. The PXFP does not generate the projection database, but only gets the sub-node of the prefix, and then increases the length of the frequent pattern by the position information of the frequent sub-path, which reduces scanning the database. The experimental results show that the PXFP algorithm achieves higher time and space efficiency than the PrefixSpan algorithm.
作者 张洁 毛国君
出处 《计算机系统应用》 2018年第1期78-85,共8页 Computer Systems & Applications
基金 国家自然科学基金(61273293)
关键词 XML频繁路径挖掘 序列化 位置信息 前缀 XML frequent path mining serialization location information prefix
  • 相关文献

参考文献7

二级参考文献83

共引文献50

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部