摘要
网页新闻正文自动抽取属于信息抽取领域中的重要研究问题,现有基于行块分布进行新闻正文自动抽取的方法对短文本段落的新闻正文抽取效果较差。为了改善这种现状,提出了一种改进的中文静态网页新闻正文自动抽取算法。该方法给出了较好的行块分割策略来构建行块分布函数,并提出使用最长公共子序列作为新闻正文内容起始行块和结束行块的快速定位方法的判别准则。最后在1 000个新闻网页上对算法的性能进行了实验验证,得出新算法的平均抽取准确率为95. 0%,平均召回率为96. 54%,正文平均遗失率为1. 6%,抽取单个网页的平均耗时为0. 13 s。实验结果充分说明了新算法能适应大规模的网页新闻正文自动抽取任务。
The automatic extraction of web page news content is an important research issue in the field of information extraction.The current method of automatic extraction of news body based on the blocks distribution is less effective in extracting short text paragraph.In order to improve this situation,an improved automatic text extraction algorithm for Chinese static web pages is proposed.This method gives a better block segmentation strategy to build a block distribution function,and puts forward using the longest common subsequence as a rapid positioning method norm for the start and end blocks of news content.Finally,the performance of the algorithm was tested on 1 000 news web pages.The average extraction accuracy rate of the new algorithm was 95.0%,the average recall rate was 96.54%,the content average loss rate was 1.6%,and the average time consumed to extract single web page was 0.13 seconds.The experimental results fully illustrate that the new algorithm can adapt to the large-scale automatic extraction of web news content.
作者
何春辉
王孟然
HE Chunhui;WANG Mengran(Engineering Training Center,Xiangtan University,Xiangtan 411105,China)
出处
《东莞理工学院学报》
2018年第5期46-50,共5页
Journal of Dongguan University of Technology
关键词
行块分布
自动抽取
快速定位
最长公共子序列
block distribution
automatic extraction
rapid positioning
longest common subsequence