基于簇核心的XML结构聚类方法被引量：4

XML Structural Clustering Based on Cluster-Core

下载PDF

导出

摘要随着XML技术的不断应用和推广,XML结构聚类技术在XML管理与挖掘中扮演着重要角色.针对目前XML结构聚类算法聚类不准确、效率低、对数据输入次序敏感的不足,提出簇核心的概念,并指出在动态环境下,对簇核心加以正确维护可以支持增量式聚类.在此基础上设计了一套有效的XML结构聚类算法COXClustering,该算法涵盖静态聚类和增量式聚类,静态聚类提取子树作为特征合理反映XML结构之间的相似性,并利用簇核心快速分类的特点提高聚类效率,利用簇核心正交的特点降低对数据输入次序的敏感性;增量式聚类根据当前增加的XML文档动态调整簇核心,从而自适应地指导增量式聚类.理论分析和实验表明该算法静态聚类效率高、聚类质量好、能够有效屏蔽输入次序的敏感性,增量式聚类将聚类速度大幅度提升,聚类质量接近静态聚类质量. With the increasing applications and developments of XML, XML structural clustering plays an important role both in management and in mining of XML documents. Although many XML structural clustering algorithms are proposed, they are ineffective, inefficient and sensitive to input order in practice. In addition, they can＇t satisfy incremental clustering under some certain background. This paper addresses these problems by proposing a novel concept--cluster-core, and points out that incremental clustering can be supported if the cluster-cores are mantained correctly in dynamic environment. An effective XML structural clustering algorithm, COXClustering, is presented, which covers static clustering and incremental clustering. In static clustering, COXClustering extracts sub-trees to measure similarity between XML structures, and it utilizes classification to improve clustering efficiency and reduces sensitivity to input order by the orthogonality of cluster-cores. In incremental clustering, it dynamically adjusts cluster-cores based on current added XML documents, and then guides incremental clustering through both instant adjustment and batch adjustment adaptively. Finally, a comprehensive experiment on both synthetic and real dataset is conducted to show that COXClustering is capable of improving clustering efficiency and quality, as well as being insensitive to input order in static clustering. The experiment also shows that incremental clustering highly speeds up clustering and the quality of incremental clustering is close to that of static clustering.

作者张翀唐九阳肖卫东汤大权

机构地区国防科学技术大学信息系统工程重点实验室

出处《计算机研究与发展》 EI CSCD 北大核心 2011年第11期2161-2176,共16页 Journal of Computer Research and Development

基金国家自然科学基金项目(60172012)

关键词 XML结构聚类簇核心特征关联度输入次序敏感性增量式聚类 XML structural clustering cluster-core feature association degree sensitivity of inputorder incremental clustering

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献32

1XML Core Working Group. Extensible Markup Language (XML) 1. 0 (Third Edition), W3C Recommendation'04 EEB/OL]. (2004-02-04) [2010-12-18]. http://www, w3. org/TR/2OO4/REC xml-20040204.
2Kozielski M. Improving the results and performance of clustering bit-encoded XML documents[C]//Proc of the 6th IEEE Int Conf on Data Mining- Workshops(ICDMW'06). Piscataway, NJ: IEEE, 2006:60-64.
3DBLP XML Records [EB/OL]. 2001 E2010 12-181. http:// www. acre. org/sigmod/dblp/db/index, html, 2001.
4Crescenzi V, Mecca G, Merialdo P. RoadRunner: Towards automatic data extraction from large web sites[C] //Proc of the 27th Very Large Data Bases Conf (VLDB'01). San Francisco: Morgan Kawfmann, 2001:109-118.
5Tekli J, Chbeir R, Yetongnon K. An overview on XML similarity= Background, current trends and future directions [J]. Computer Science Review, 2009, 8(3): 151-173.
6Guillaume D, Murtagh F. Clustering of XML Documents [J]. Computer Physics Communications, 2000, 127 (2/3) : 215-227.
7Lian W, Cheung D W, Mamoulis N, et al. An efficient and scalable algorithm for clustering XML documents by structure [J]. IEEE Trans on Knowledge and Data Engineering, 2004, 16(1):82-96.
8Yoon J P, Raghavan V, Chakilam V. Bitmap indexing based clustering and retrieval of XML documents [C] //Proc of ACM SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval. New York: ACM, 2001.
9Leung H P, Chung F L, Chan S C F, et al. XML document clustering using common XPath [C]//Proc of the 2005 International Workshop on Challenges in Web Information Retrieval and Integration (IEEE WIRI'05). Piseataway, NJ: IEEE, 2005. 97-96.
10Zhang K, Shasha D. Simple fast algorithms for the editing distance between trees and related problems[J]. SIAM Journal Computing, 1989, 18(6): 1245-1262.

二级参考文献19

1[1]Wong VWS, Leung CM. Location management for next generation personal communication networks. IEEE Network, 2000,14(5):18～24.
2[2]Han JW, Kambr M. Data Mining Concepts and Techniques. Beijing: Higher Education Press, 2001. 335～393.
3[3]Ng R, Han J. Efficient and effective clustering method for spatial data mining. In: Bocca JB, Jarke M, Zaniolo C, eds. Proc. of the 20th Int'l Conf. on Very Large Data Bases. San Fransisco: Morgan Kaufmann Publisheers, 1994. 144～155.
4[4]Guha S, Rastogi R, Shim K. CURE: An efficient clustering algorithm for large databases. In: Haas LM, Tiwary A, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. New York: ACM Press, 1998.73～84.
5[5]Guha S, Rastogi R, Shim K, ROCK: A robust clustering algorithm for categorical attributes. In: Proc. of the 15th Int'l Conf. on Data Engineering. IEEE Computer Society, 1999. 512～521.
6[6]Karypis G, Han E-H, Kumar V. Chameleon: Hierarchical clustering using dynamic modeling. IEEE Computer, 1999,32(8):68～75.
7[7]Estivill-Castro V, Lee I. AMOEBA: Hierarchical clustering based on spatial proximity using delaunay diagram. In: Forer P, Yeh AGO, He J, eds. Proc. of the 9th Int'l Symposium on Spatial Data Handling. Hong Kong: Study Group on Geographical Information Science of the International Geographical Union, 2000. 7a.26～7a.41.
8[8]Ester M, Kriegel HP, Sander J, Xu X. A density based algorithm for discovering clusters in large spatial databases with noise. In:Simoudis E, Han JW, Fayyad UM, eds. Proc. of the 2nd Int'l Conf. on Knowledge Discovery and Data Mining. Portland: AAAI Press, 1996. 226～231.
9[9]Ma S, Wang TJ, Tang SW, Yang DQ, Gao J. A new fast clustering algorithm based on reference and density. In: Dong GZ, Tang CJ,Wang W, eds. Proc. of the WAIM Conf. Heidelberg: Springer-Verlag, 2003.214～225.
10[10]Wang W, Yang J, Muntz R. STING+: An approach to active spatial data mining. In: Proc. of the 15th Int'l Conf. on Data Engineering. IEEE Computer Society, 1999. 119～125.

共引文献4

1徐新华,谢永红.增量聚类综述及增量DBSCAN聚类算法研究[J].华北航天工业学院学报,2006,16(2):15-17. 被引量：5
2欧阳浩,肖建华.移动数据预估聚类分析算法[J].计算机工程与应用,2008,44(2):197-201. 被引量：1
3洪亮,卢炎生,陈锦富,丁晓锋.一种基于位置数据库聚类的动态适应缓存位置信息策略[J].计算机研究与发展,2008,45(7):1203-1210. 被引量：1
4吴聪,李勃,董蓉,陈启美.基于车型聚类的交通流参数视频检测[J].自动化学报,2011,37(5):569-576. 被引量：10

同被引文献74

1王桐,刘大昕.一种新的混合XML文档聚类方法[J].哈尔滨工程大学学报,2007,28(6):697-701. 被引量：7
2IDC. Worldwide quarterly mobile phone tracker [EB/OL]. [2013-01-20]. http://www, idc. com/getdoe, jsp?containerld = prUS24108913.
3Engadget. Google play hits 25 billion app downloads[EB/ OL]. (2012-09- 16) [2013-01-20]. http://www, engadget. com[2012[O9]26]google-play-hits-25 billion app-downloads/.
4网秦.2012上半年全球手机安全报告[EB/OL].[2013-01-20].http://on.nq.com/neirong/2012shang.pdf.
5Wisniewski R. Brut. alll @ gmail, com. android apktool [CP/OL]. [ 2013-01-20 ]. https://code, google, corn/p/ android-apktool/.
6Gruver B. jesusfreke @ jesusfreke, corn, small [CP/OL]. [2013- 01- 20]. http://code, google, corn/p/small/.
7Google. DDMS [CP/OL]. ]2013-01 -20]. http://developer. android, com]guide/developing/debugging/ddms, htrnl.
8Dupuy E. JD-GUI [CP/OL]. [2013-01-20]. http://java. decompiler, free. fr/.
9Panxiaobo. pxb1988 @ gmail, corn, yyjdelete @ gmail, com. dex2jar [CP/OL]. [2013-01-20]. http://code, google, corn/p/ dex2jar/.
10Shabtai A, Kanonov U, Elovici Y, et al. "Andromaly": A behavioral malware detection framework for android devices [J]. Journal of Intelligent Information System, 2012, 38 (1): 161-190.

引证文献4

1焦四辈,应凌云,杨轶,程瑶,苏璞睿,冯登国.一种抗混淆的大规模Android应用相似性检测方法[J].计算机研究与发展,2014,51(7):1446-1457. 被引量：9
2万静,张义,何云斌,李松.基于KD-树和K-means动态聚类方法研究[J].计算机应用研究,2015,32(12):3590-3595. 被引量：16
3王成勇,杜庆伟,孙静,孙振.基于特征偏好的XML文档聚类算法[J].计算机工程与应用,2016,52(12):64-68.
4滕少华,涂宏俊,刘冬宁.基于子结构逻辑的不确定性语义时态查询技术研究[J].江西师范大学学报（自然科学版）,2017,41(6):645-650. 被引量：1

二级引证文献26

1李萌.一种基于权限和行为融合的Android应用推荐方法[J].信息网络安全,2020(S01):111-114.
2周文振,陈国良,杜珊珊,李飞.一种聚类改进的迭代最近点配准算法[J].激光与光电子学进展,2016,53(5):196-202. 被引量：11
3叶李.传感器网络时间序列数据的事件分类研究[J].重庆邮电大学学报（自然科学版）,2016,28(3):421-425. 被引量：1
4朱亚迪,吴毅坚,赵文耘.基于代码片段复用的安卓应用组装技术研究[J].计算机应用与软件,2016,33(11):164-168. 被引量：2
5韩静丹,孙磊,王帅丽,王泽武.基于BPSO-NB算法的Android恶意应用检测方法[J].计算机与现代化,2017(4):109-113. 被引量：1
6徐爱萍,王波,徐武平.HBase中基于时空特征的监测视频大数据关联查询研究[J].计算机应用研究,2017,34(5):1423-1427. 被引量：4
7濮君强.基于聚类分析技术的新能源汽车数据挖掘分析[J].自动化与仪器仪表,2018,0(3):173-176. 被引量：3
8郭韧,黄淑蓉,程小刚.基于动态聚类的跨境电子商务物流信息匹配研究[J].图书馆学研究,2018(1):89-94. 被引量：5
9冯勇,张学理,王嵘冰,徐红艳.融入密度和距离的K-means初始簇中心优选方法研究[J].小型微型计算机系统,2018,39(8):1805-1808. 被引量：5
10唐奔宵,王丽娜,汪润,赵磊,王丹磊.基于差分隐私的Android物理传感器侧信道防御方法[J].计算机研究与发展,2018,55(7):1371-1392. 被引量：4

1滕明贵,熊范纶,吴正龙.一种对二维空间对象进行聚类的算法[J].模式识别与人工智能,2005,18(3):297-302.
2吴佳,罗可.改进的模糊C均值的增量聚类算法[J].计算机工程与应用,2011,47(23):141-142. 被引量：4
3陈爱国,王士同.基于多代表点的大规模数据模糊聚类算法[J].控制与决策,2016,31(12):2122-2130. 被引量：9
4赵鸣,吴磊.改进性的文本聚类算法研究[J].长江大学学报（自科版）（上旬）,2009,6(2):73-75.
5陈炜.SQL server2000事务日志文件深入剖析[J].长沙民政职业技术学院学报,2006,13(2):116-118. 被引量：1
6陈红军.怎样让硬盘更长久使用[J].电脑时空,2003(9):89-89.
7唐浩.电脑使用之硬盘日常维护[J].电脑入门,2009(6):30-30.
8周萍.浅析计算机硬件的日常维护[J].消费电子,2013(22):79-79.
9毛国君,曹永存.基于数据概要描述的分布式数据流聚类模型与算法[J].计算机科学,2013,40(6):187-191. 被引量：4
10徐玉辰,刘真,张付志.基于增量式聚类和矩阵分解的鲁棒推荐方法[J].小型微型计算机系统,2015,36(4):689-695. 被引量：3

计算机研究与发展

2011年第11期

浏览历史

内容加载中请稍等...

基于簇核心的XML结构聚类方法被引量：4

参考文献32

二级参考文献19

共引文献4

同被引文献74

引证文献4

二级引证文献26

相关作者

相关机构

相关主题

浏览历史

基于簇核心的XML结构聚类方法 被引量：4

参考文献32

二级参考文献19

共引文献4

同被引文献74

引证文献4

二级引证文献26

相关作者

相关机构

相关主题

浏览历史

基于簇核心的XML结构聚类方法被引量：4