期刊文献+

基于LCS的特征树最大相似性匹配网页去噪算法 被引量:3

Maximum Similarity Matching Algorithm for Noise Reduction in Web Pages Based on LCS
下载PDF
导出
摘要 提出了一种基于LCS的特征树最大相似性匹配网页去噪算法。通过将目标网页和相似网页转化为特征树,并将特征树映射为一个特征节点序列,利用LCS算法能获得最长子序列全局最优解的特点,找出两棵特征树之间的不同节点作为候选集,并对候选集进行聚集评分找出网页重要内容块。给出了算法的原型系统,并对每一个模块的实现做了详尽的描述。 A maximum similarity matching algorithm for noise reduction in Web pages is presented based on LCS. Parsing target page and similar pages into two characteristic trees, and mapping them to two characteristic node sequences, the LCS algorithm can get the longest sub-sequence which is global optimal solution, and the different characteristic nodes is found out between the two characteristic tree as a candidate set, clustering the candidate set and scoring to identify web page important informative block. In this paper, the algorithm prototype is given, and the implementation of each module is described.
出处 《电视技术》 北大核心 2011年第13期44-48,63,共6页 Video Engineering
基金 国家"863"计划项目(2008BAH28B04) 上海市科委资助项目(08dz1500108) 中国博士后科学基金 上海市博士后基金资助项目(20090460637 10R21414800)
关键词 LCS 特征树 网页去噪 LCS characteristic tree noise reduction in Web pages
  • 相关文献

参考文献11

  • 1YI Lan, LIU Bing, LI Xiaoli. Eliminating noisy information in web pages for data mining [C]// Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. Washington, DC: s.n., 2003 : 296-305.
  • 2王厚芹,车士义.推进我国三网融合势在必行[J].电视技术,2010,34(6):109-112. 被引量:22
  • 3CAI D, YU S, WEN J R, et al.Extracting content structure for web pages based on visual representation. Asia Pacific[C]//Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications. Xi'an:s.n. ,2003:406-417.
  • 4CAI D, YU S, WEN J R, et al.VIPS: a vision-based page segmentation algorithm[R].Microsoft Technical Report: MSR-TR-2003-79,2003.
  • 5SONG Ruihua,LIU Haifeng,WEN JiRong,et al. Learning block importance models for web pages[C]//Proceedings of ACM SIGKDD Explorations Newsletter. New York : [s.n.], 2004(6) : 14-23.
  • 6刘晨曦,吴扬扬.一种基于块分析的网页去噪音方法[J].广西师范大学学报(自然科学版),2007,25(2):149-152. 被引量:8
  • 7DEBNATH S,MITRA P,PAL N,et al. Automatic identification of informative sections of web pages[J].IEEE Transactions on Knowledge and Data Engineering, 2005,17(9) : 1233-1246.
  • 8Lidong Bing, Yexin Wang, Yan Zhang, et al.Primary Content Extraction with Mountain Model[C]//Proceedings of 2008 IEEE 8th International Conference on Computer and Information Technology.[S.l.] : IEEE Press, 2008 : 479-484.
  • 9LI Yuancheng, YANG Jie. A novel method to extract informative blocks from web pages[C]//Proceedings of the 2009 International Joint Conference on Artificial Intelligence.2009: 536-539.
  • 10REIS D C, GOLGHER P B, SILVA A S, et al. Automatic web news extraction using tree edit distancc[C]//Proceedings of the 13th International Conference on WoAd Wide Web. New York: ACM, 2004: 502-511.

二级参考文献12

  • 1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2005,45(S1):1743-1747. 被引量:70
  • 2荆涛,左万利.基于可视布局信息的网页噪音去除算法[J].华南理工大学学报(自然科学版),2004,32(z1):84-87. 被引量:21
  • 3张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57
  • 4GREENSTEIN S,KHANNA T. What does it mean for industries to converge?[M]// YOFFIE D. Competing in an Age of Digital Convergence. Cambridge, MA : Harvard University Press, 1997:201-226.
  • 5中国电子信息产业发展研究院.中国电子信息产业经济运行规律研究(中):IT经济研究[R].北京:中国电子信息产业发展研究院.2006.
  • 6张永生.厂商规模无关论:理论与经验证据[M].北京:中国人民大学出版社.
  • 7夏皮罗,瓦里安.信息规则:网络经济的策略指导[M].张帆,译.北京:中国人民大学出版社.2000.
  • 8GUPTA S,KAISER G,NEISTADT D,et al.DOM-based content extraction of HTML documents[C]//Proceeding of the 12th International Conference on World Wide Web.New York:ACM Press,2003:207-214.
  • 9LIN Shian-hua,HO Jan-ming.Discovering informative content blocks from Web documents[C]//Proceeding of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM Press,2002:588-593.
  • 10CAI Deng,YU Shi-peng,WEN Ji-rong,et al.Extracting content structure for Web pages based on visual representation[C]//Proceeding of the 5th Asia Pacific Web Conference.Berlin:Springer-Verlag,2003:406-417.

共引文献28

同被引文献24

引证文献3

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部