摘要
针对现存的语义主题爬虫没有考虑主题意义的延伸、页面与主题的相似度计算模型存在的缺陷及主题词项细化过于苛刻导致返回结果较少等不足,采用LDA(Latent Dirichlet Allocation)模型,对主题词项描述文档进行降维,改进语义相似度计算模型.引入语义信息的相似度计算模型(SVSM),用SVSM计算文档和主题模型的相似度.从本体中获取该主题词项的上位词,构建主题上位词的主题模型,爬虫依据现有网络中的主题重新获取主题相关信息,提出语义聚焦爬虫(ESVSM),通过不同主题下多个爬虫进行实验对比,发现所提出的基于主题建模和上位词替换的ESVSM算法在收获率、相关网页数量和网页平均相关度中均优于其他算法,平均抓取精度达到85%.
There are limitations for the existing semantic based focused crawlers:without considering the thematic meaning extension,the similarity computing model between page content and the topic and less results if the topic term refinement is too harsh.By adopting LDA(Latent Dirichlet Model),realization of the topic model by reduce the dimensionality of the describe document of the topic words,Semantic similarity computation,this paper proposes the semantic similarity model(SVSM).SVSM is used for computing the similarity between the document and the topic model.Constructed the topic model the hypernym of topic term,and grasp the semantic related pages of the topic model of the hypernym,as the topic related resources.Through multiple topics and several comparative experiments,the performance of our algorithm is better in harvest rate,the number of relevant web pages and the average correlation of web pages,the average grab accuracy is 85%.
作者
孙红光
藏润强
姬传德
杨凤芹
冯国忠
SUN Hong-guang;ZANG Run-qiang;JI Chuan-de;YANG Feng-qin;FENG Guo-zhong(School of Information Science and Technology, Northeast Normal University, Changchun 130117, China;Key Laboratory of Intelligent Information Processing in Jilin Province,Changchun 130117,China;College of Computer Science and Technology, Jilin University, Changchun 130012, China)
出处
《东北师大学报(自然科学版)》
CAS
CSCD
北大核心
2018年第2期51-57,共7页
Journal of Northeast Normal University(Natural Science Edition)
基金
国家自然科学基金青年基金资助项目(11501095)
吉林省科技创新人才培育计划项目(20170520051JH)
吉林省科技发展计划项目(20170204002GX)
吉林省发改委引导项目(2015Y056)