摘要
传统电力网络舆情数据采集方法存在召回率低、计算准确率不高以及耗时长等问题,为此,利用主题爬虫技术对数据采集方法进行改进。首先,采用主题爬虫技术搭建数据采集框架,以框架为基础,构建网络舆情的主题向量;其次,定义网络舆情主题及关键字,利用相似度模型计算关键字向量与电力网页的相似度,并添加到网络爬虫队列中;最后,采用最佳优先搜索策略,将最高相似度网页设定为第一优先级,下载并存储网络舆情相关数据,完成数据爬取,实现数据采集。实验结果表明,本中方法平均召回率高达92%,网页相似性计算准确率高于90%,且数据采集耗时均值为36 min,均优于对比方法。
The traditional public opinion data collection methods of power network have some problems,such as low recall rate,low calculation accuracy and being time-consuming.Therefore,the topic crawler technology was used to improve the data collection method.Firstly,the topic crawler technology was used to build the data collection framework,and based on the framework,the topic vector of network public opinion was constructed.Secondly,we defined the topic and keyword of network public opinion,and calculated the similarity between keyword vector and power web page by using the similarity model,which was added to the web crawler queue.Finally,used the best first search strategy,setted the web page with the highest similarity as the first priority,downloaded and stored network public opinion related data,completed data crawling and realized data collection.The experimental results show that the average recall rate of the method in this paper is as high as 92%,the accuracy of web page similarity calculation is higher than 90%,and the average time of data acquisition is 36 minutes,which is better than the comparison method.
作者
奚增辉
王卫斌
陆嘉铭
瞿海妮
XI Zenghui;WANG Weibin;LU Jiaming;QU Haini(State Grid Shanghai Municipal Electric Power Company,Shanghai 200122,China)
出处
《西安工程大学学报》
CAS
2022年第2期72-78,共7页
Journal of Xi’an Polytechnic University
基金
上海市科学技术研究项目(GSH190983)。
关键词
网络爬虫
电力网络
网络舆情
主题向量
数据采集
主题索引
web crawler
power network
internet public opinion
topic vector
data acquisition
topic index