考虑层数信息的XML文档聚类方法

Clustering XML documents by layer information

下载PDF

导出

摘要提出了一种层数敏感的XML文档数据集聚类方法CXLI。首先提出结构表概念,消除XML文档的重复和嵌套结构。然后提出考虑层数信息的XML文档基本编辑操作约束。进一步给出考虑层数信息的XML文档间相似性度量方法。最后使用凝聚型层次聚类方法对XML文档数据集进行聚类。在ACM SIGMOD数据集和人工生成的数据集上进行了实验验证,结果表明:在计算时间基本相同的情况下,CXLI方法具有更好的精确度。 A layer-sensitive XML document collection clustering method CXLI is proposed in this paper. First, a concept of structural table is put forward to clear up the duplication structures in XML documents. Second, the constraints o{ editing operations are established. Third, a testing method of the similarity between XML documents is presented. Finally, the XML documents are clustered using agglomerative hierarchical clustering method. ACM SIMOD data set and synthetic data set are used to test the proposed method. Results show that the proposed CXLI has better precision under similar time cost.

作者刘兆军赵浩宇王婧李雄飞李巍

机构地区符号计算与知识工程教育部重点实验室吉林大学计算机科学与技术学院吉林大学软件学院

出处《吉林大学学报（工学版）》 EI CAS CSCD 北大核心 2014年第1期124-128,共5页 Journal of Jilin University:Engineering and Technology Edition

基金吉林省科技发展计划项目(20090704) 吉林省自然科学基金项目(201115020)

关键词人工智能数据挖掘可扩展标记语言相似性度量聚类层数 artificient intelligence data mining XML similarity detection clustering layer

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献15

1Abiteboul S, Buneman P, Suciu D. Data on the Web [M]. San Francisco: Morgan Kaufmann, 2000.
2Wilde E, Glushko R J. XML fever[J]. Communica tions of the ACM, 2008, 51(7): 40-46.
3$elkow $ M. The tree to tree editing problem[J].Information Processing Letters, 1977,6 (6): 184- 186.
4Zhang K, Shasha D. Simple fast algorithms for the editing distance between trees and related problems [J]. SIAM Journal on Computing, 1989, 18(6): 1245-1262.
5Chawathe S. Comparing hierarchical data in external memory[C]//Proc of the 25th International Confer ence on Very Large Data Bases, San Francisco: Morgan Kaufmann, 1999:90 101.
6Chawathe S, Rajaraman A, Garcia-Molina H, et al. Change detection in hierarchically structured infor- mation[C]//ACM SIGMOD International Confer ence on Management of Data, ACM: Montreal, Canada, 1996:493-504.
7Nierman A, Jagadish H. Evaluating structural simi larity in XML documents[C]//Proc of the 5th Inter national Workshop on the Web and Databases, Wis consin: Madison, 2002:61-66.
8Dalamagas T, Cheng T, Winkel K J, et al. A meth- odology for clustering XMI. documents by structure [J]. Information Systems, 2006,31(3): 187-228.
9Flesea S, Manco G, Masciari E, et al. Fast detec- tion of XML structural similarity[J]. IEEE Trans actions on Knowledge and Data Engineering, 2005, 17(2) : 160-175.
10Tekli J, Chbeir R, Yetongnon K. An overview on XMI. similarity: background, current trends and fu- ture directions[J]. Computer Science Review, 2009, 3(3) : 151-173.

二级参考文献12

1Rakesh Agrawal, Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. VLDB1994, Santiago,Chile, 1994.
2Heikki Mannila, et al. Search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery,1997, 1(3): 241～258.
3Jong Soo Park, et al. An effective Hash based algorithm for mining association rules. SIGMOD1995, San Jose, USA, 1995.
4Sergey Brin, et al. Dynamic itemset counting and implication rules for market basket data. SIGMOD1997, Tucson, USA,1997.
5Ramesh C. Agarwal, et al. Depth first generation of long patterns, KDD 2000, Boston, USA, 2000.
6Ramesh C. Agarwal, et al. A tree projection algorithm for generation of frequent itemsets. J. of Parallel and Distributed Computing, 2001, 61(3): 350～371.
7Jiawei Han, Jian Pei, Yiwen Yin. Mining frequent patterns without candidate generation. SIGMOD2000, Dallas, USA, 2000.
8J. Pei, et al.. H-Mine: Hyper-structure mining of frequent patterns in large databases. ICDM'01, San Jose, CA, 2001.
9Mike Perkowitz, Oren Etzioni. Adaptive sites: Automatically learning from user access patterns. WWW' 97, Santa Clara, 1997.
10J. Pei, et al.. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE'01, Heidelberg, 2001.

共引文献16

1詹宇斌,殷建平,张玲,龙军,程杰仁.一种基于有向树挖掘Web日志中最大频繁访问模式的方法[J].计算机应用,2006,26(7):1662-1665. 被引量：9
2陈亮,高建民,李青,陈琨.基于频繁活动序列挖掘的过程改进机会分析[J].西安交通大学学报,2006,40(11):1310-1314. 被引量：1
3刘勇,李建中,朱敬华.一种新的基于频繁闭显露模式的图分类方法[J].计算机研究与发展,2007,44(7):1169-1176. 被引量：10
4吴卫江,李国和.一种基于极大连通子图的电信社群网分割算法[J].计算机工程与应用,2008,44(5):8-9. 被引量：2
5王涛.一种基于频繁子树的数据库索引方法[J].华中科技大学学报（自然科学版）,2008,36(3):103-106.
6周军,姜元春,林文龙.基于有向带权图的Web用户浏览行为模型[J].情报理论与实践,2008,31(5):795-798. 被引量：1
7范建中,王福庆.基于权重的完全图聚类在企业信息系统的应用[J].电脑与电信,2009(2):81-83.
8白洪涛,欧阳丹彤,何丽莉.一种基于图形处理器的频繁模式挖掘算法[J].仪器仪表学报,2009,30(10):2082-2087. 被引量：6
9张硕,李建中,高宏,邹兆年.一种多到一子图同构检测方法[J].软件学报,2010,21(3):401-414. 被引量：7
10蒋廷耀,廖强.一种基于局部重构树的改进频繁子图挖掘算法[J].武汉理工大学学报（信息与管理工程版）,2011,33(6):864-867.

1张凤荔,周洪川,张俊娇,刘渊,张春瑞.基于改进凝聚层次聚类的协议分类算法[J].计算机工程与科学,2017,39(4):796-803. 被引量：7
2刘兴波.凝聚型层次聚类算法的研究[J].科技信息,2008(11):202-202. 被引量：5
3李玲玲.关于凝聚型层次聚类时间复杂度的研究[J].宿州学院学报,2011,26(2):21-22. 被引量：4
4吕琳,尉永清,任敏,潘晓.基于蚁群优化算法的凝聚型层次聚类[J].计算机应用研究,2017,34(1):114-117. 被引量：14
5李玲玲.改进的基于距离阈值的FCM算法[J].贵州师范学院学报,2011,27(9):11-14.
6邓婵.一种改进的基于镜头聚类的关键帧提取算法[J].科学与财富,2014,0(12):327-328.
7李向军,徐国华,刘立平.一种文本聚类算法[J].西北大学学报（自然科学版）,2005,35(2):155-158. 被引量：3
8王莉,张广明,周献中.基于改进FCM模糊神经网络的水处理过程建模[J].制造业自动化,2010,32(8):102-105.
9史士财,李荣,付宜利,马玉林.基于改进蚁群算法的装配序列规划[J].计算机集成制造系统,2010,16(6):1189-1194. 被引量：33
10赵伟丽,张志国,孙艳蕊.基于蚁群算法的混合属性数据集聚类分析[J].教育技术导刊,2008(2):152-155. 被引量：1

吉林大学学报（工学版）

2014年第1期

浏览历史

内容加载中请稍等...

考虑层数信息的XML文档聚类方法

参考文献15

二级参考文献12

共引文献16

相关作者

相关机构

相关主题

浏览历史