期刊文献+

应用主题爬虫的电力网络舆情数据采集 被引量:6

Public opinion data collection of power network using topic crawler
下载PDF
导出
摘要 传统电力网络舆情数据采集方法存在召回率低、计算准确率不高以及耗时长等问题,为此,利用主题爬虫技术对数据采集方法进行改进。首先,采用主题爬虫技术搭建数据采集框架,以框架为基础,构建网络舆情的主题向量;其次,定义网络舆情主题及关键字,利用相似度模型计算关键字向量与电力网页的相似度,并添加到网络爬虫队列中;最后,采用最佳优先搜索策略,将最高相似度网页设定为第一优先级,下载并存储网络舆情相关数据,完成数据爬取,实现数据采集。实验结果表明,本中方法平均召回率高达92%,网页相似性计算准确率高于90%,且数据采集耗时均值为36 min,均优于对比方法。 The traditional public opinion data collection methods of power network have some problems,such as low recall rate,low calculation accuracy and being time-consuming.Therefore,the topic crawler technology was used to improve the data collection method.Firstly,the topic crawler technology was used to build the data collection framework,and based on the framework,the topic vector of network public opinion was constructed.Secondly,we defined the topic and keyword of network public opinion,and calculated the similarity between keyword vector and power web page by using the similarity model,which was added to the web crawler queue.Finally,used the best first search strategy,setted the web page with the highest similarity as the first priority,downloaded and stored network public opinion related data,completed data crawling and realized data collection.The experimental results show that the average recall rate of the method in this paper is as high as 92%,the accuracy of web page similarity calculation is higher than 90%,and the average time of data acquisition is 36 minutes,which is better than the comparison method.
作者 奚增辉 王卫斌 陆嘉铭 瞿海妮 XI Zenghui;WANG Weibin;LU Jiaming;QU Haini(State Grid Shanghai Municipal Electric Power Company,Shanghai 200122,China)
出处 《西安工程大学学报》 CAS 2022年第2期72-78,共7页 Journal of Xi’an Polytechnic University
基金 上海市科学技术研究项目(GSH190983)。
关键词 网络爬虫 电力网络 网络舆情 主题向量 数据采集 主题索引 web crawler power network internet public opinion topic vector data acquisition topic index
  • 相关文献

参考文献26

二级参考文献272

共引文献325

同被引文献46

引证文献6

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部