摘要
目前主流开源爬虫框架在分析页面与主题领域关联性上,常采用基于关键词的量化和向量空间模型算法相融合,但融合疏忽了界面语义与特定主题间的关联,导致爬取内容与主题产生偏差。为了给金融等领域的舆情分析提供准确的数据支撑,提出一种面向领域扩展主题库的爬虫及系统,通过扩展主题特征库,融合向量空间模型(Vector Space Model,VSM)与超链接主题搜索算法(Hyperlink-Induced Topic Search,HITS),优化了主题页面相关度计算,并针对股票舆情信息爬取进行仿真。结果表明,上述扩展主题型爬虫在爬取准确率和效率等方面有较好地提升,能够有效地完成领域主题信息的爬取任务。
At present,mainstream open-source crawler frameworks often use keyword-based quantification and vector space model algorithms to analyze the relevance of pages and subject areas.However,the integration ignores the relationship between interface semantics and specific topics,which causes the deviation between crawling content and topics.In order to provide data support for public opinion analysis of finance and other fields,this paper proposes an extended topic crawler and system.By expanding the topic feature library and integrating the Vector Space Model(VSM)and the Hyperlink-Induced Topic Search(HITS)algorithm,it optimizes the relevance calculation of topic pages.Finally,the simulation experiments of crawling stock public opinion information show that the crawling accuracy and efficiency of extended topic crawlers have been improved and the crawler can accomplish the task of domain topic information acquisition effectively.
作者
陶飞飞
徐佳
徐松阳
唐明伟
TAO Fei-fei;XU Jia;XU Song-yang;TANG Ming-wei(School of Computer and Information,Hohai University,Nanjing Jiangsu 210098,China;School of Computer,Nanjing Audit University,Nanjing Jiangsu 211815,China)
出处
《计算机仿真》
2024年第10期222-226,共5页
Computer Simulation
基金
国家自然基金科学基金项目(42001250)
国家重点研发计划项目(2018YFC1508100)
江苏高校哲学社会科学研究重大项目(2021SJZDA153)。
关键词
扩展主题爬虫
向量空间模型
超链接主题搜索
股票舆情信息
Extended topic crawler
Vector space model
Hyperlink-induced topic search
Stock public opinion information