摘要
针对传统主题爬虫方法容易陷入局部最优和主题描述不足的问题,提出一种融合本体和改进禁忌搜索策略(On-ITS)的主题爬虫方法。首先利用本体语义相似度计算主题语义向量,基于超级文本标记语言(HTML)网页文本特征位置加权构建网页文本特征向量,然后采用向量空间模型计算网页的主题相关度。在此基础上,计算锚文本主题相关度以及链接指向网页的PR值,综合分析链接优先度。另外,为了避免爬虫陷入局部最优,设计了基于ITS的主题爬虫,优化爬行队列。以暴雨灾害和台风灾害为主题,在相同的实验环境下,基于On-ITS的主题爬虫方法比对比算法的爬准率最多高58%,最少高8%,其他评价指标也很好。基于On-ITS的主题爬虫方法能有效提高获取领域信息的准确性,抓取更多与主题相关的网页。
Considering the problems that the traditional focused crawler is easy to fall into local optimum and has insufficient topic description,a focused crawler method combining Ontology and Improved Tabu Search(On-ITS)was proposed.First,the topic semantic vector was calculated by ontology semantic similarity,and the Web page text feature vector was constructed by Hyper Text Markup Language(HTML)Web page text feature position weighting.Then,the vector space model was used to calculate the topic relevance of Web pages.On this basis,in order to analyze the comprehensive priority of link,the topic relevance of the link anchor text and the PR(PageRank)value of Web page to the link were calculated.In addition,to avoid the crawler falling into local optimum,the focused crawler based on ITS was designed to optimize the crawling queue.Experimental results of the focused crawler on the topics of rainstorm disaster and typhoon disaster show that,under the same environment,the accuracy of the On-ITS method is higher than those of the contrast algorithms by maximum of 58%and minimum of 8%,and other evaluation indicators of the proposed algorithm are also very excellent.On-ITS focused crawler method can effectively improve the accuracy of obtaining domain information and catch more topic-related Web pages.
作者
刘景发
顾瑶平
刘文杰
LIU Jingfa;GU Yaoping;LIU Wenjie(School of Computer and Software,Nanjing University of Information Science and Technology,Nanjing Jiangsu 210044,China;School of Information Science and Technology,Guangdong University of Foreign Studies,Guangzhou Guangdong 510006,China)
出处
《计算机应用》
CSCD
北大核心
2020年第8期2255-2261,共7页
journal of Computer Applications
基金
国家社会科学基金重大招标项目(16ZDA047)
江苏省自然科学基金资助项目(BK20181409)
广州市基础与应用基础研究项目。
关键词
主题爬虫
禁忌搜索
本体
主题相关度
气象灾害
focused crawler
Tabu search
ontology
topic relevance
meteorological disaster