摘要
如何正确识别网页中存在的网页评论、导航、版权声明等噪音数据,提高网页抽取正文的准确度,提出了一种结合多种文本特征的正文抽取算法(CETD-TPF).在文本块密度和标签路径覆盖率的基础之上又融合了文本符号特征,利用新特征确定并抽取正文文本块.此方法有效的解决了网页正文短文本难以抽取的问题,且无需人工训练和处理.在对各大知名新闻网站随机选取的数据集进行实验表明,CETD-TPF方法在不同数据源上的适用性较好,正文抽取精度好于CETR、CETD算法.
To correctly identify noise data such as web page comments,navigation,and copyright notices and so on,a new method is presented for text extraction(CETD-TPF)combining multiple text features with an improved accuracy of webpage extraction.Based on the text block density and label path coverage,this method innovates to add the text symbol feature and use the new feature to define and extract the text block,which has effectively made the short text of webpages easier to extract without any manual training and processing.The experiments on data sets randomly selected from well-known news websites show that the CETD-TPF method has an excellent applicability on different data sources,and the text extraction precision is better than CETR and CETD algorithms.
作者
郑野
宋旭东
于林林
陈鑫影
ZHENG Ye;SONG Xudong;YU Linlin;CHEN Xinying(Software Institute,Dalian Jiaotong University,Dalian 116028,China;Pigital Technology Institute,Dalian University of Science and Technology,Dalian 116052,China)
出处
《大连交通大学学报》
CAS
2019年第5期112-116,共5页
Journal of Dalian Jiaotong University
基金
辽宁省自然科学基金资助项目(1553735707452,20170540144)