一种改进的树路径模型在网页聚类中的研究被引量：1

Research of Improved Tree Path Model in Web Page Clustering

下载PDF

导出

摘要相似度计算是文本挖掘的基础,也是信息提取过程的关键步骤。对于结构复杂的网页,当前基于传统树路径模型的相似度计算方法在准确性上尚不完善。传统树路径模型未考虑路径出现的先后顺序,并且比较路径相似度时用的是完全匹配,难以在不完全匹配时更精确地描述路径之间的相似度。因此,从网页结构相似度入手,提出了一种改进的树路径模型。该模型充分考虑了兄弟节点之间的关系、路径位置以及路径权重,弥补了传统树路径模型无法表达文档结构和层次信息的缺陷。实验结果表明,该模型提高了识别网页结构相似性的能力,既能对结构差别较大的网页进行良好的区分,又能较好地反映来自同一模板的网页之间的差异性,同时在网页聚类中具有更优的效果。 Computing the similarity is the basis of text mining, and also the crucial step of information extraction. When tackling the Web pages with complex structure, the accuracy of computing the similarity based on traditional tree path model is not perfect. Traditional tree path model will not take the sequence of the paths in consideration and compare the similarity of paths by using perfect matching. It cannot describe the similarity between paths accurately when it is not a perfect matching. Therefore,the paper introduced the structural similarity Web at first,and then proposed a tree path model. This model takes fully account of the relationship between the siblings, the path location and the path weights,and makes up for the defect of the traditional tree path model which cannot express both document structure and hierarchical information. The experiment result shows that the model improves the recognition ability of Web pages structural similarity. It not only can better distinguish the Web pages which have large structure difference, but also effectively reflects the difference between the Web pages with the same template, also has a better effect in the Web page clustering.

作者王亚普王志坚叶枫

机构地区河海大学计算机与信息学院南京航空航天大学计算机科学与技术学院

出处《计算机科学》 CSCD 北大核心 2015年第5期109-113,共5页 Computer Science

基金江苏水利科技项目:"智慧河流"研究及其在六合滁河管理中的应用(2013025) 河海大学中央高校基本科研业务费项目(2009B21614)资助

关键词信息提取网页结构相似度树路径模型聚类 Information extraction, Web page structure, Similarity, Tree path model, Clustering

分类号 TP311.5 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献11

1Li Yan-heng.The XML-based Information Extraction on Data-intensive Page[C]∥IFIP International Conference onNetwork and Parallel Computing Workshops,2007.NPC Workshops,IEEE,2007:1027-1030.
2Li R,Pei C,Zheng J.Web Information Extraction Based on Hybrid Conditional Model[C]∥2010 Second International Workshop on Education Technology and Computer Science (ETCS).IEEE,2010,1:137-140.
3何昕,谢志鹏.基于简单树匹配算法的Web页面结构相似性度量[J].计算机研究与发展,2007,44(z3):1-6. 被引量：15
4Tai K C.The tree-to-tree correction problem[J].Journal of the ACM (JACM),1979,26(3):422-433.
5Cruz I F,Borisov S,Marks M A,et al.Measuring structural simi-larity among Web documents:preliminary results[M]∥Electronic Publishing,Artistic Imaging,and Digital Typography.Springer Berlin Heidelberg,1998:513-524.
6Joshi S,Agrawal N,Krishnapuram R,et al.A bag of paths modelfor measuring structural similarity in Web documents[C]∥Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining.ACM,2003:577-582.
7王志琪,王永成.HTML文件的文本信息预处理技术[J].计算机工程,2006,32(5):46-48. 被引量：12
8Gupta S,Kaiser G,Neistadt D,et al.DOM-based content extraction of HTML documents[C]∥Proceedings of the 12th International Conference on World Wide Web.ACM,2003:207-214.
9Bajcsy P,Ahuja N.Location-and density-based hierarchical clustering using similarity analysis[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1998,20(9):1011-1015.
10Han J,Kamber M,Pei J.Data Mining:Concepts and Techniques (Third Edition)[M].Thailand:Elsevier Pte Ltd,2012:297-302.

二级参考文献10

1[1]Zhenjiang Lin,Michael R Lyu,Irwin King.PageSim:A novel linkbased measure of Web page similarity.In:Proc of the 15th WWW Conf.Los Alamitos:IEEE Computer Society Press,2006.1019-1020
2[2]Anastasios Tombros,Zeeshan Ali.Factors affecting Web page similarity.In:Proc of the ECIR 2005.Berlin:Springer,2005.487-501
3[3]K Tai.The tree to tree correction problem.Journal of the ACM,1979,26(3):422-433
4[4]Sachindra Joshi,Neeraj Agrawal,Raghu Krishnapuram,et al.A bag of paths model for measuring structural similarity in Web documents.In:Proc of the 9th ACM SIGKDD Conf.New York:ACM Press,2003.577-582
5[5]Isabel F Cruz,Slava Borisov,Michael A Marks,et al.Measuring structural similarity among Web documents:Preliminary results.In:Proc of the 7th Int'l Conf on Electronic Publishing.London:Springer,1998
6[6]Bernhard Kurpl,Marcus Herzog,Wolfgan Gatterbauer.Using visual cues for extraction of tabular data from arbitrary HTML documents.In:Proc of the 14th WWW Conf.New York:ACM Press,2005.1000-1001
7[7]W Yang.Identifying syntactic differences between two programes.Software Practical Experiment,1991,21(7):739-755
8Lemay L,Danesh A.宛延闿,周晓牧,苏俊等译.HTML Web页面制作教程[M].北京:清华大学出版社,2000.
9Tkach D.Technology Text Mining:Turning Information into Knowledge[R].A White Paper from IBM,1998.
10Baizilay R,Elhadad M.Using Lexical Chains for Text Summari-zation[C].Proceeding of the ACL'97 / EACL'97 Workshop on Intelligent Scalable Text Summarization,Madrid,Spain,1997:10.

共引文献24

1宋明秋,张瑞雪.基于HTML树的网页结构相似度研究[J].情报学报,2011,30(2):160-165. 被引量：2
2程仁贵.带反向词频的中英文词典的设计[J].重庆工学院学报（自然科学版）,2008,22(11):165-168.
3方元康,胡学钢,夏启寿,朱勇.改进的Web日志数据预处理技术[J].计算机工程,2009,35(10):73-74. 被引量：3
4宋明秋,张瑞雪,吴新涛,李文立.网页正文信息抽取新方法[J].大连理工大学学报,2009,49(4):594-597. 被引量：20
5王舒,朱敏,张明,牛颢,赵瑜.一种基于特征符号的网页主题信息抽取方法[J].计算机应用研究,2009,26(12):4539-4541. 被引量：4
6孔胜,王宇.一种基于正文特征的新闻网页抽取方法[J].情报杂志,2010,29(8):122-124. 被引量：7
7黄荣.基于模板的网页主题信息抽取模型[J].科技信息,2011(4):250-251. 被引量：1
8张瑞雪,宋明秋,公衍磊.逆序解析DOM树及网页正文信息提取[J].计算机科学,2011,38(4):213-215. 被引量：15
9钱程,阳小兰.HTML到XML转换研究[J].计算机与现代化,2011(8):39-41. 被引量：2
10宋明秋,张瑞雪.基于链路压缩树的网页相似度研究[J].情报学报,2012,31(1):40-46. 被引量：2

同被引文献6

1熊子奇,张晖,林茂松.基于相似度的中文网页正文提取算法[J].西南科技大学学报,2010,25(1):80-84. 被引量：3
2王少康,董科军,阎保平.使用特征文本密度的网页正文提取[J].计算机工程与应用,2010,46(20):1-3. 被引量：13
3段晓丽,王宇,谷静,刘玮楠.基于正文特征及网页结构的主题网页信息抽取[J].计算机工程与应用,2012,48(30):151-156. 被引量：10
4廖浩伟,杨燕,贾真,尹红风.一种改进的基于树路径匹配的网页结构相似度算法[J].吉林大学学报（理学版）,2012,50(6):1199-1203. 被引量：7
5熊忠阳,蔺显强,张玉芳,牙漫.结合网页结构与文本特征的正文提取方法[J].计算机工程,2013,39(12):200-203. 被引量：15
6杨柳青,李晓东,耿光刚.基于布局相似性的网页正文内容提取研究[J].计算机应用研究,2015,32(9):2581-2586. 被引量：10

引证文献1

1王海涌,冯兆旭,杨海波,张津栋.基于结构相似网页聚类的正文提取算法研究[J].计算机工程与应用,2018,54(11):122-127. 被引量：2

二级引证文献2

1陈前华,胡嘉杰,江吉,吴豪.采用长短期记忆网络的深度学习方法进行网页正文提取[J].计算机应用,2021,41(S01):20-24. 被引量：3
2余杨奎,王旅,李婉茹,程振林,刘洁.一种基于页面赋权的网页内容提取方法[J].通化师范学院学报,2021,42(10):20-28.

1吕林涛,董迎.基于上下文的概念语义相似度计算模型[J].计算机工程,2010,36(21):59-61. 被引量：7
2韩启恒.由驱动引发的故障[J].网管员世界,2011(10):87-87.
3陈锐,张蕾,卢春俊,牟力科.基于概念图的信息检索的查询扩展模型[J].计算机应用,2009,29(2):545-548.
4孔德因.都是万能遥控器惹的祸[J].家电维修（大众版）,2013(4):16-17.
5姚行艳,蔡乐才,莫再峰.基于向量空间模型的路径相似度蚁群算法研究[J].四川理工学院学报（自然科学版）,2008,21(5):43-45. 被引量：1
6王华,王治和,王平.Web用户聚类研究[J].甘肃联合大学学报（自然科学版）,2010,24(1):79-82. 被引量：3
7红烧真人.寻找和更改文件安装路径[J].电脑爱好者,2017,0(3):25-26.
8边学丛,朱婵,何衍兴.引入路径权重蚁群算法在应急救援中的应用[J].工业安全与环保,2010,36(11):60-62. 被引量：3
9黄洋,陈文.基于攻击图的网络风险计算方法[J].计算机安全,2013(7):7-10. 被引量：1
10赵艳妮,郭华磊,马军生.基于路径权重的XML文档相似度仿真研究[J].计算机技术与发展,2016,26(9):197-200.

计算机科学

2015年第5期

浏览历史

内容加载中请稍等...

一种改进的树路径模型在网页聚类中的研究被引量：1

参考文献11

二级参考文献10

共引文献24

同被引文献6

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

一种改进的树路径模型在网页聚类中的研究 被引量：1

参考文献11

二级参考文献10

共引文献24

同被引文献6

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

一种改进的树路径模型在网页聚类中的研究被引量：1