摘要
在大数据和移动互联网的时代背景下,舆情信息的迅猛增长为其采集与分析带来挑战。运用分布式计算技术,有利于对领域海量主题舆情的快速采集与分析。研究主题舆情采集与分析关键技术,包括主题舆情采集技术、领域词典和中文分词,探讨分布式计算环境下的主题舆情采集与舆情数据分析,并利用面向对象的分析与设计方法,基于开源爬虫设计并实现了一个分布式主题舆情采集与分析系统。利用4个爬虫节点进行分布式采集,相比传统采集模式,该系统的平均采集速度提升了2.74倍。
In the era of big data and mobile Internet,the rapid growth of public opinion information brings challenges to its collection and analysis,and the design of distributed subject public opinion collection and analysis system is conducive to the rapid collection and analysis of mass subject public opinion information.The key technologies of subject public opinion collection and analysis are stud⁃ied,including subject public opinion collection,field dictionary and segmentation of Chinese word.The collection and analysis tech⁃nology of subject public opinion in distributed computing environment is discussed.A distributed subject public opinion collection and analysis system based on open source crawler is designed and implemented by object-oriented analysis and design method.Four crawl⁃er nodes are used for distributed collection,and the average collection speed is improved by 2.74 times compared with the single-ma⁃chine collection mode.
作者
董富江
张文学
DONG Fu-jiang;ZHANG Wen-xue(College of Science,Ningxia Medical University,Yinchuan 750004,China)
出处
《软件导刊》
2020年第11期116-119,共4页
Software Guide
基金
宁夏自然科学基金项目(2020AAC03122)
宁夏医科大学基金项目(NYJY2055)。
关键词
分布式
主题舆情
信息采集
开源爬虫
distributed
subject public opinion
information collection
open source crawler