期刊文献+

结合文本密度的语义聚焦爬虫方法 被引量:1

SEMANTIC FOCUSED CRAWLER METHOD COMBINING TEXT DENSITY
下载PDF
导出
摘要 针对聚焦爬虫网页核心内容提取算法准确性偏低以及相似度计算模型语义信息考虑不充分造成的爬取准确度和效率偏低的问题,提出结合文本密度的语义聚焦爬虫方法。引入核心内容提取算法,使用标题结合LCS算法定位核心内容文本的起始和终止位置,提取网页核心内容。引入基于Word2vec的主题相关度算法计算核心内容的主题相关度,改进PageRank算法计算链接主题重要度。结合主题相关度和主题重要度计算链接优先级。此外,为提高聚焦爬虫的全局搜索性能,结合主题词使用搜索引擎扩展链接集。与通用爬虫和多种聚焦爬虫相比,该方法爬虫爬取准确度和效率更优。 In view of the problems of low accuracy and low efficiency of focused crawler caused by the low accuracy in web core content extraction algorithm and insufficient consideration of semantic information in similarity computing model, we proposed a semantic focused crawler method combining text density. The core content extraction algorithm was introduced to use the title combined with the LCS algorithm to locate the starting and ending positions of the core content, then extracted the core content of the web page. A topic relevance algorithm based on Word2vec was introduced to calculate the topic relevance of core content, and the PageRank algorithm was improved to calculate the importance between the link and the topic. We combined topic relevance and topic importance to calculate the link priority. In addition, in order to improve the global search performance of focused crawler, search engine was used to expand the link set with Keywords. Compared with universal crawlers and multiple focused crawlers, our method is more accurate and efficient.
作者 林椹尠 袁柱 李小平 Lin Zhenxian;Yuan Zhu;Li Xiaoping(School of Science, Xi’an University of Post and Telecommunications, Xi’an 710121, Shaanxi, China;School of Communication and Information Engineering, Xi’an University of Post and Telecommunications, Xi’an 710121, Shaanxi, China)
出处 《计算机应用与软件》 北大核心 2019年第9期270-275,共6页 Computer Applications and Software
基金 陕西省教育厅专项科学研究基金项目(18JK0699)
关键词 聚焦爬虫 核心内容 LCS Word2vec 链接优先级 Focused crawler Core content LCS Word2vec Link priority
  • 相关文献

参考文献3

二级参考文献14

共引文献33

同被引文献6

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部