摘要
针对抽取中文网页正文的传统方法的不足,提出一种基于统计的中文网页正文抽取方法。该方法首先利用DOM树计算文本结点的文本密度,即文本长度与HTML源码长度之比,再利用贝叶斯判别准则计算密度区分阈值,最后根据文本密度与密度区分阈值的比较结果抽取正文,即大于密度区分阈值的结点就判定为正文文本结点,小于或等于密度区分阈值的结点则判定为非正文文本结点,将所有判定为正文文本结点的文本连接起来即为要抽取的网页正文。通过使用中文新闻类网页对该方法的有效性进行验证,结果表明:该方法虽然简单,但是抽取准确率极高且易于实现。
In view of the shortcomings of traditional methods,this paper proposed a statistical method for extracting full text from Chinese web pages.It is simple,but accurate and easy to be implemented.This approach extracted full text of Chinese web pages based on the text density of each text node which is computed by caculating the ratio of text to html code length according to DOM tree.The pretty good full text is filtered out by comparing the text density to a fixed threshold.The fixed threshold of text density is got by using Bayesian criteria.Experimental results show that the proposed method is an effective solution to extract full text from Chinese web pages,especially for Chinese web news.
出处
《情报学报》
CSSCI
北大核心
2009年第2期187-194,共8页
Journal of the China Society for Scientific and Technical Information