摘要
针对传统文本聚类存在数据维度过高,无法深层次理解语义等问题,提出一种基于特征词典构建和BIRCH算法的文本聚类方法。该方法通过LDA主题模型和语义特征构建特征词典,利用BIRCH算法进行文本聚类,并对维基百科、百度百科和互动百科中的景点、动物、人物和国家四个主题的网页文档进行实验分析。实验结果表明,特征词典结合了主题关键词和语义相似度,其准确率、召回率和F特征值较传统方法有所提高,该方法可以广泛应用于文本挖掘、知识图谱和自然语言处理等领域。
Aiming at the problem that traditional text clustering has too high data dimension and deep understanding of semantics,a text clustering method based on feature dictionary construction and BIRCH algorithm is proposed. This method builds a feature dictionary through LDA topic model and semantic features, uses BIRCH algorithm to perform text clustering. an experimental analysis on Wikipedia, Baidu Encyclopedia and Interactive Encyclopedia WebPages is conducted on attractions, animals, characters and countries four topics, the results show that combined with the topic keywords and semantic similarity, the accuracy, recall and F eigenvalues of the feature dictionary are improved compared with traditional methods. This method can be widely used in text mining, knowledge mapping and natural language processing.
作者
杨秀璋
夏换
于小民
武帅
赵紫如
窦悦琪
Yang Xiuzhang;Xia Huan;Yu Xiaomin;Wu Shuai;Zhao Ziru;Dou Yueqi(School of Information,Guizhou University of Finance and Economics,Guiyang,Guizhou 550025,China;Guizhou Key Laboratory of Economics System Simulation of Guizhou University of Finance and Economics)
出处
《计算机时代》
2019年第11期23-27,31,共6页
Computer Era
基金
贵州省科技计划项目“多源地理数据融合知识图谱构建方法在舆情分析中的应用——以贵州省为例”(黔科合基础[2019]1041)
贵州省科技计划项目“圆形地下连续墙结构时变性仿真研究”(黔科合基础[2019]1403)
贵州省教育厅青年科技人才成长项目“实体和属性对齐方法的研究与实现”(黔教合KY字[2016]172)
贵州省教育厅青年科技人才成长项目“无线校园网络建设中Mesh网关负载均衡问题研究”(黔教合KY字[2016]178)
贵州省普通高等学校科技拔尖人才支持计划项目“定向钻机远程实时监控大数据分析评价系统”(黔教合KY字[2016]068)