摘要
新闻网页里面包含大量文字分段标签,相比网页其它区域的噪音内容,其主题内容区域的文字分段标签较多。根据这一特点引入局部最优标签树搜索算法。通过搜寻同级节点中分段标签最多的容器节点,消除其它容器节点,从而实现网页净化方法。实验证明方法实现简单、净化效果明显,特别是对新闻类主题文字网页净化效果显著。
A news web page has a lot of paragraph tags, most of which exist in topic zones, and a little in noise zones. According to this feature, a novel purification approach is proposd based on the local optimal DOM tree algorithm. Through searching sibling nodes for the one with the most number of paragraph tags, eliminating the other nodes, a purified DOM tree is got. That is the tree for the purified Web page. This approach is simple and significant, especially to the topic text Web pages.
出处
《科学技术与工程》
北大核心
2012年第35期9556-9561,共6页
Science Technology and Engineering
基金
重庆第二师范学院研究项目(KY201176C、KY201175C)资助
关键词
网页净化
信息提取
HTML标签
局部最优
网页噪音
Web page purification information extraction HTML tags local optimal Web pagenoise9561