摘要
【目的】研究新闻文本的特征降维方法及聚类算法,以期进一步提升热点话题发现效率及准确率。【方法】基于传统TF-IDF特征权重计算方法,引入符号、词性、位置及长度4个特征加权,实现多因素特征选择。从编码方式、适应度函数、自适应步长及群体适应度方差这4方面构造改进果蝇优化算法(AmelioratedFruitFly Optimization Algorithm, AFOA),利用AFOA优选K-means初始聚类中心,实现优化后的K-means热点话题发现。采用多因素特征选择识别热点话题,利用TOPSIS获得热点话题排名。【结果】相关实验表明,多因素特征选择及AFOA/K-means算法分别显著提高了聚类效果,验证了所提方法整体有效性。【局限】仅适用于中文新闻文本。【结论】本文方法能够为中文新闻热点发现方法研究提供一条新思路。
[Objective] This paper aims to improve the efficiency and accuracy of the hot topic by studying the feature reduction method and clustering algorithm of the news text.[Methods] Based on the traditional TF-IDF formula,the four features are introduced to realize multi factor feature selection,including weighting of symbol,part of speech,position and length.The Ameliorated Fruit fly Optimization Algorithm(AFOA) is constructed from four aspects of coding,fitness function,adaptive step length and population fitness variance.AFOA is used to optimize the K-means initial cluster center,and the optimized K-means is used to find hot topics.Multi factor feature selection is used to identify hot topics,and hot topic ranking is achieved by using TOPSIS.[Results] Relevant experiments show that multi factor feature selection and AFOA/K-means algorithm significantly improve the clustering effect respectively,and verify the overall effectiveness of the proposed method.[Limitations] It is only applicable to Chinese news texts.[Conclusions] The proposed method can provide a new idea for the research of Chinese news hotspots discovery.
作者
温廷新
李洋子
孙静霜
Wen Tingxin;Li Yangzi;Sun Jingshuang(Institute of Systems Engineering,Liaoning Technical University,Huludao 125105,China;College of Business Administration,Liaoning Technical University,Huludao 125105,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2019年第4期97-106,共10页
Data Analysis and Knowledge Discovery
基金
辽宁省社会科学规划基金项目"辽宁新型城镇化评价指标体系研究"(项目编号:L14BTJ004)的研究成果之一