摘要
本文提出了一种基于DOM树的正文提取方法。该方法是在基于DOM树的文本密度的正文提取算法的框架上改进而来的。基于对文言文翻译网站的观察,本方法使用标点符号密度取代原方法的文本密度。通过随机选取50篇文言文翻译网页作为测试集,本文提出的方法获得了更好的准确率、召回率和F值。
This paper proposes a DOM based content extraction method. It is improved from the DOM based content extraction via text density. Based on the observation of classical Chinese translation websites,the paper uses point density to replace text density. 50 classical Chinese translaiton webpages are randomly chosen as the test data set,the proposed method obtains better precision,recall,and F- measure.
出处
《智能计算机与应用》
2015年第4期42-44,47,共4页
Intelligent Computer and Applications