摘要
为了减少或根除新闻网站中大量非主题信息的干扰,提出一种新闻网页抽取方法,采用基于熵的计算和DOM树的知识,从新闻网页中抽取主题文档和相关链接。
In this paper,an approach for news article extraction from Web page is proposed and this approach applies information theory to DOM tree. Experiment on several news Web sites shows that it is practical.
出处
《现代图书情报技术》
CSSCI
北大核心
2007年第4期48-51,共4页
New Technology of Library and Information Service
关键词
熵
信息抽取
信息块
DOM
Entropy Information extraction Informative block DOM