摘要
本文提出了基于双层决策的新闻网页正文的精确抽取算法,双层决策是指对新闻网页正文所在区域的全局范围决策和对正文范围内每段文字是否确是正文的局部内容决策。首先根据实际应用的需要给出了新闻网页正文的严格界定,然后分析了新闻网页及其正文的特性,提出了基于双层决策的正文抽取策略,基于特征向量提取和决策树学习算法对上述双层决策进行了建模,并在国内10个主要新闻网站的1687个新闻页面上开展了模型训练和测试实验。实验结果表明,上述基于双层决策的方法能够精确地抽取出新闻网页的正文,最终正文抽取与人工标注不完全一致的网页比例仅为18.14%,比单纯局部正文内容决策的方法相对下降了29.85%,同时抽取误差率大于10%的网页比例更是仅为7.11%,满足了实际应用的需要。
This paper concerns content extraction from news web pages based on decisions of two layers. The first layer of decision is introduced to predict the scope of content in a webpage, and the second layer is employed to judge whether the paragraph within predicted scope is content or not. We firstly present a strict definition of content for web pages orienting to the practical applications, then analyze the characteristics of news web pages and their contents. Based on the analysis, we propose a content extraction method based on decisions of two layers, and carry out experiments on a corpus of 1867 HTMLs collected from 10 main news web sites in China. The experiment results show that our method can predict the content of news web pages quite well: the percentage of web pages which contain mismatching in extracted content is only 18.14%, which decreases 29. 85% compared to that just based on the second layer prediction, and only 7. 11% of extracted pages are with more than 10% mismatching,indicating that this method could be applied to practical applications.
出处
《中文信息学报》
CSCD
北大核心
2006年第6期1-9,103,共10页
Journal of Chinese Information Processing
基金
国家自然科学基金资助项目(69975018)
关键词
计算机应用
中文信息处理
信息抽取
特征向量
决策树
正文抽取
computer application
Chinese information processing
information extraction
feature vector
decision tree
content extraction