摘要
[目的/意义]研究微博语料库和数据处理相关技术,以实现对微博主题语料库的设计。[方法/过程]选取"大V"微博用户和草根微博用户各500个,采集每个用户发布的前300条信息作为研究对象,对微博数据进行预处理及信息过滤,构建由"大V"用户高频词、草根用户高频词、"大V"用户高频词排名、草根用户高频词排名4个部分组成的主题语料库。[结果/结论 ]该语料库具有查看、搜索、添加及高频词排名等功能,能够查询相应的"大V"用户和草根用户的微博主题高频词。
[ Purpose/significance ]The paper is to study relevant technologies of microblog corpus and data processing, so as to design a microblog theme corpus. [Method/process]The paper selects 500 "big V" users and 500 grassroots users, collects the top 300 pieces of information from each user as research object, preprocesses and filters the data, and constructs a theme corpus composed of high-frequency words of "big V" users, high-frequency words of grassroots users, high-frequency words ranking of "big V" users, and high-frequency words ranking of grassroots users.[ Result/conclusion ]The corpus provides functions of viewing, search, adds and high- frequency words ranking, and offers query service on microblog theme high-frequency words of corresponding "big V" user or grass-roots user.
出处
《情报探索》
2016年第10期65-67,共3页
Information Research
基金
广东省哲学社会科学"十二五"规划项目"微博公共事件自动发现及演化模型研究"(项目编号:GD14YXW02)
国家自然科学基金项目"面向微博公共事件的反向社会情绪识别及演化分析研究"(项目编号:61572145)成果之一
关键词
微博
语料库
高频词
microblog
corpus
high-frequency word