期刊文献+

基于扩展主题特征库的领域主题爬虫 被引量:2

Focused crawler based on extended topic feature library
下载PDF
导出
摘要 在领域主题爬虫中,为提高网页爬取的效率和准确性,将扩展主题特征库(extended topic feature library,ETFL)引入进爬虫的网页过滤算法中。将网页抽象为标签块节点集,通过主题特征库扩展算法对静态特征项进行扩充生成扩展主题特征库,利用网页主题特征项提取算法从页面中抽取出特征项,在爬虫抓取网页的过程中,通过基于扩展主题特征库的网页相关性判断方法对页面进行过滤。该算法弥补了传统的基于静态关键词项的网页过滤算法对页面语义层次处理的缺失。实际项目运行结果表明,在领域主题爬虫中引入扩展主题库能够有效提高网页抓取精度,具有较高可用性。 To improve the efficiency and accuracy of Web crawling in focused crawler,extended topic feature library was intro-duced into Web page filtering algorithm.Web page was abstracted as a set of label block nodes,static feature items were expan-ded to generate extended topic feature library using topic feature library extension algorithm,and Web page topic feature item ex-traction algorithm was used to extract feature items from pages.During the process of crawler fetching documents,Web pages were filtered using Web page relevance decision method based on extended topic feature library.The algorithm makes up the va-cant problem of semantic processing in traditional Web page filtering algorithm based on static keyword items.Results of applica-tion in actual proj ects show that the introduction of extended topic feature library into focused crawler can improve the accuracy of Web scraping,and it possesses higher availability.
出处 《计算机工程与设计》 北大核心 2015年第5期1342-1347,共6页 Computer Engineering and Design
基金 国家自然科学基金项目(61272109)
关键词 主题特征库 网页过滤 标签块 相关性 语义 topic feature library Web filtering tag block relevance semantic
  • 相关文献

参考文献10

二级参考文献99

共引文献261

同被引文献15

引证文献2

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部