期刊文献+

一种改进的树路径模型在网页聚类中的研究 被引量:1

Research of Improved Tree Path Model in Web Page Clustering
下载PDF
导出
摘要 相似度计算是文本挖掘的基础,也是信息提取过程的关键步骤。对于结构复杂的网页,当前基于传统树路径模型的相似度计算方法在准确性上尚不完善。传统树路径模型未考虑路径出现的先后顺序,并且比较路径相似度时用的是完全匹配,难以在不完全匹配时更精确地描述路径之间的相似度。因此,从网页结构相似度入手,提出了一种改进的树路径模型。该模型充分考虑了兄弟节点之间的关系、路径位置以及路径权重,弥补了传统树路径模型无法表达文档结构和层次信息的缺陷。实验结果表明,该模型提高了识别网页结构相似性的能力,既能对结构差别较大的网页进行良好的区分,又能较好地反映来自同一模板的网页之间的差异性,同时在网页聚类中具有更优的效果。 Computing the similarity is the basis of text mining, and also the crucial step of information extraction. When tackling the Web pages with complex structure, the accuracy of computing the similarity based on traditional tree path model is not perfect. Traditional tree path model will not take the sequence of the paths in consideration and compare the similarity of paths by using perfect matching. It cannot describe the similarity between paths accurately when it is not a perfect matching. Therefore,the paper introduced the structural similarity Web at first,and then proposed a tree path model. This model takes fully account of the relationship between the siblings, the path location and the path weights,and makes up for the defect of the traditional tree path model which cannot express both document structure and hierarchical information. The experiment result shows that the model improves the recognition ability of Web pages structural similarity. It not only can better distinguish the Web pages which have large structure difference, but also effectively reflects the difference between the Web pages with the same template, also has a better effect in the Web page clustering.
出处 《计算机科学》 CSCD 北大核心 2015年第5期109-113,共5页 Computer Science
基金 江苏水利科技项目:"智慧河流"研究及其在六合滁河管理中的应用(2013025) 河海大学中央高校基本科研业务费项目(2009B21614)资助
关键词 信息提取 网页结构 相似度 树路径模型 聚类 Information extraction, Web page structure, Similarity, Tree path model, Clustering
  • 相关文献

参考文献11

  • 1Li Yan-heng.The XML-based Information Extraction on Data-intensive Page[C]∥IFIP International Conference onNetwork and Parallel Computing Workshops,2007.NPC Workshops,IEEE,2007:1027-1030.
  • 2Li R,Pei C,Zheng J.Web Information Extraction Based on Hybrid Conditional Model[C]∥2010 Second International Workshop on Education Technology and Computer Science (ETCS).IEEE,2010,1:137-140.
  • 3何昕,谢志鹏.基于简单树匹配算法的Web页面结构相似性度量[J].计算机研究与发展,2007,44(z3):1-6. 被引量:15
  • 4Tai K C.The tree-to-tree correction problem[J].Journal of the ACM (JACM),1979,26(3):422-433.
  • 5Cruz I F,Borisov S,Marks M A,et al.Measuring structural simi-larity among Web documents:preliminary results[M]∥Electronic Publishing,Artistic Imaging,and Digital Typography.Springer Berlin Heidelberg,1998:513-524.
  • 6Joshi S,Agrawal N,Krishnapuram R,et al.A bag of paths modelfor measuring structural similarity in Web documents[C]∥Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining.ACM,2003:577-582.
  • 7王志琪,王永成.HTML文件的文本信息预处理技术[J].计算机工程,2006,32(5):46-48. 被引量:12
  • 8Gupta S,Kaiser G,Neistadt D,et al.DOM-based content extraction of HTML documents[C]∥Proceedings of the 12th International Conference on World Wide Web.ACM,2003:207-214.
  • 9Bajcsy P,Ahuja N.Location-and density-based hierarchical clustering using similarity analysis[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1998,20(9):1011-1015.
  • 10Han J,Kamber M,Pei J.Data Mining:Concepts and Techniques (Third Edition)[M].Thailand:Elsevier Pte Ltd,2012:297-302.

二级参考文献10

  • 1[1]Zhenjiang Lin,Michael R Lyu,Irwin King.PageSim:A novel linkbased measure of Web page similarity.In:Proc of the 15th WWW Conf.Los Alamitos:IEEE Computer Society Press,2006.1019-1020
  • 2[2]Anastasios Tombros,Zeeshan Ali.Factors affecting Web page similarity.In:Proc of the ECIR 2005.Berlin:Springer,2005.487-501
  • 3[3]K Tai.The tree to tree correction problem.Journal of the ACM,1979,26(3):422-433
  • 4[4]Sachindra Joshi,Neeraj Agrawal,Raghu Krishnapuram,et al.A bag of paths model for measuring structural similarity in Web documents.In:Proc of the 9th ACM SIGKDD Conf.New York:ACM Press,2003.577-582
  • 5[5]Isabel F Cruz,Slava Borisov,Michael A Marks,et al.Measuring structural similarity among Web documents:Preliminary results.In:Proc of the 7th Int'l Conf on Electronic Publishing.London:Springer,1998
  • 6[6]Bernhard Kurpl,Marcus Herzog,Wolfgan Gatterbauer.Using visual cues for extraction of tabular data from arbitrary HTML documents.In:Proc of the 14th WWW Conf.New York:ACM Press,2005.1000-1001
  • 7[7]W Yang.Identifying syntactic differences between two programes.Software Practical Experiment,1991,21(7):739-755
  • 8Lemay L,Danesh A.宛延闿,周晓牧,苏俊等译.HTML Web页面制作教程[M].北京:清华大学出版社,2000.
  • 9Tkach D.Technology Text Mining:Turning Information into Knowledge[R].A White Paper from IBM,1998.
  • 10Baizilay R,Elhadad M.Using Lexical Chains for Text Summari-zation[C].Proceeding of the ACL'97 / EACL'97 Workshop on Intelligent Scalable Text Summarization,Madrid,Spain,1997:10.

共引文献24

同被引文献6

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部