摘要
大多数资讯类网页都包含了与资讯正文无关的内容,如推荐、广告等,这些噪声对获取资讯正文具有较大干扰性。针对基于文本及符号密度的网页正文提取方法(TSD)没有考虑段落标签对提取效果的影响部分进行改进,提出基于文本及HTML标签密度的网页正文提取方法(TTD),通过对页面文本内容和标签的统计分析,可以快速提取正文内容,适用于常见的资讯网站,具有较强的通用性。实验表明,该方法的提取效果较当前常用的方法在准确度上有较大提升,具有较高的实用性。
Most information web pages contain content that has nothing to do with the infor-mation body,such as recommendations,advertisements and so on.These noises have consider-able interference with the acquisition of information text and should be removed.For the im-provement of text extraction method based on text and symbol density(TSD)based on text and symbol density without considering the influence of paragraph tags on the extraction effect,this paper proposes a web page text content extraction method based on text and HTML tag density(TTD).Through the statistical analysis of the page text content,the text content can be ex-tracted quickly,which is suitable for common information websites and has strong universality.Experiments show that the extraction effect of this method is greatly improved in accuracy and practicability compared with the current commonly used methods.
作者
杨大为
王诗念
包立岩
要虹吏
刘畅
YANG Dawei;WANG Shinian;BAO Liyan;YAO Hongli;LIU Chang(Shenyang Ligong University,Shenyang 110159,China)
出处
《沈阳理工大学学报》
CAS
2022年第4期14-19,共6页
Journal of Shenyang Ligong University
基金
辽宁省教育厅科学研究经费项目(LG201915)
沈阳理工大学科研创新团队建设计划资助项目(SYLUTD202105)。