期刊文献+

基于决策树与单元距离抽取新闻网页内容

Content Extraction from News Web Pages Based on Decision Trees and Unit Distance
下载PDF
导出
摘要 针对新闻网页文本处理问题,提出了一种基于决策树抽取新闻标题并利用单元距离识别正文的方法.该方法将文本相似度、网页标记和属性作为决策树节点选择的测试属性项,各属性项的信息熵计算同时考虑了与标题相关和不相关的因素,在此基础上建立决策树,并根据规则定位新闻标题.利用网页标记的嵌套特征,缩小查找范围,根据网页各信息块间的显著边界定位新闻正文.实验结果表明,该方法抽取新闻标题的准确率在87%以上,抽取正文的平均准确率达到76%,对其他网页文本处理具有一定借鉴意义. Concerning the processing of news web pages,an extracted news headline and text method based on decision trees and unit distance was proposed.Text similarity,web page tags and attributes were taken as the test of node selection in decision tree.The feature information entropy was calculated with the title related and unrelated factors.On this basis,a decision tree was established and news headlines were located according to rules.By reducing searching range according to nesting of web pages,the news text was located on the basis of information between visual block of web pages.Experimental results show that the proposed method extracts news headlines with an accuracy rate of more than 87 percent and extracts news texts with an 76 percent average accuracy rate.The method is for reference to other kind of text processing of web page.
作者 王晓 罗永莲 WANG Xiao;LUO Yong-lian(School of Information Technology & Engineering,Jinzhong University,Jinzhong Shanxi,030619,China)
出处 《晋中学院学报》 2019年第3期66-71,共6页 Journal of Jinzhong University
基金 山西省教育科学“十三五”规划课题:“基于创新创业教育理念的大数据相关专业教学模式研究”(GH-18091) 晋中学院教学改革创新项目:“创新创业教育融入数据科学和大数据技术专业教育的案例研究”(Jg201807)
关键词 信息增益 决策树 新闻网页 内容抽取 网页信息块 information gain decision tree news web page content extraction web page visual block
  • 相关文献

参考文献12

二级参考文献146

共引文献808

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部