期刊文献+

基于语义的聚焦爬虫算法研究 被引量:9

Study on the algorithm of focused crawler based on semantic similarity
下载PDF
导出
摘要 针对现存的语义主题爬虫没有考虑主题意义的延伸、页面与主题的相似度计算模型存在的缺陷及主题词项细化过于苛刻导致返回结果较少等不足,采用LDA(Latent Dirichlet Allocation)模型,对主题词项描述文档进行降维,改进语义相似度计算模型.引入语义信息的相似度计算模型(SVSM),用SVSM计算文档和主题模型的相似度.从本体中获取该主题词项的上位词,构建主题上位词的主题模型,爬虫依据现有网络中的主题重新获取主题相关信息,提出语义聚焦爬虫(ESVSM),通过不同主题下多个爬虫进行实验对比,发现所提出的基于主题建模和上位词替换的ESVSM算法在收获率、相关网页数量和网页平均相关度中均优于其他算法,平均抓取精度达到85%. There are limitations for the existing semantic based focused crawlers:without considering the thematic meaning extension,the similarity computing model between page content and the topic and less results if the topic term refinement is too harsh.By adopting LDA(Latent Dirichlet Model),realization of the topic model by reduce the dimensionality of the describe document of the topic words,Semantic similarity computation,this paper proposes the semantic similarity model(SVSM).SVSM is used for computing the similarity between the document and the topic model.Constructed the topic model the hypernym of topic term,and grasp the semantic related pages of the topic model of the hypernym,as the topic related resources.Through multiple topics and several comparative experiments,the performance of our algorithm is better in harvest rate,the number of relevant web pages and the average correlation of web pages,the average grab accuracy is 85%.
作者 孙红光 藏润强 姬传德 杨凤芹 冯国忠 SUN Hong-guang;ZANG Run-qiang;JI Chuan-de;YANG Feng-qin;FENG Guo-zhong(School of Information Science and Technology, Northeast Normal University, Changchun 130117, China;Key Laboratory of Intelligent Information Processing in Jilin Province,Changchun 130117,China;College of Computer Science and Technology, Jilin University, Changchun 130012, China)
出处 《东北师大学报(自然科学版)》 CAS CSCD 北大核心 2018年第2期51-57,共7页 Journal of Northeast Normal University(Natural Science Edition)
基金 国家自然科学基金青年基金资助项目(11501095) 吉林省科技创新人才培育计划项目(20170520051JH) 吉林省科技发展计划项目(20170204002GX) 吉林省发改委引导项目(2015Y056)
关键词 聚焦爬虫 LDA 主题模型 向量空间模型(VSM) 语义相似度 focused crawler LDA topic model vector space model(VSM) semantic similarity
  • 相关文献

参考文献2

二级参考文献23

  • 1郭艳华,周昌乐.一种汉语语句依存关系网协动生成方法研究[J].杭州电子工业学院学报,2000,20(4):24-32. 被引量:11
  • 2张奇,黄萱菁,吴立德.一种新的句子相似度度量及其在文本自动摘要中的应用[J].中文信息学报,2005,19(2):93-99. 被引量:34
  • 3杨思春.一种改进的句子相似度计算模型[J].电子科技大学学报,2006,35(6):956-959. 被引量:34
  • 4刘群 李素建.基于《知网》的词汇语义相似度计算[C]..第三界汉语词汇语义研讨会[C].台北,2002..
  • 5穗志方 俞士汶.基于骨架依存树的语句相似度计算模型[C]..中文信息处理国际会议(ICCIP98)论文集[C].,1998.458-465.
  • 6Kevin Chang Chenchuan. Structured Databases on the Web: Observations and Implications[J]. SIGMOD Record, 2004, 33(3): 61-65.
  • 7Cho J, Garcia-Molina H, Page L. Efficient Crawling Through URL Ordering[J]. Computer Networks and ISDN Systems, 1998, 30(7): 161-172.
  • 8Rennie J, McCallum A. Using Reinforcement Learning to Spider the Web Efficiently[C].Proc. of the International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann Publishers Inc., 1999: 335-343.
  • 9Diligenti M, Coetzee F M, Lawrence S, et al. Focused Crawling Using Context Graphs[C].Proc. of the International Conference on Very Large Database. San Francisco, USA: Morgan Kaufmann Publishers Inc., 2000: 527-534.
  • 10Kevin Chang Chenchuan, He Bin, Zhang Zhen. Toward Large-scale Integration: Building a MetaQuerier over Databases on the Web [C].Proc. of Conference on Innovative Data Systems Research. [S. l]: Asilomar, 2005.

共引文献27

同被引文献67

引证文献9

二级引证文献29

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部