摘要
随着智能终端的普及,文本的主题挖掘需求也越来越广泛,主题建模是文本主题挖掘的核心,LDA生成模型是基于贝叶斯框架的概率模型,它以语义关联为基础,很好地解决了文本潜在主题的提取问题。对文本聚类过程的核心技术LDA生成模型、数据采样、模型评价等作了较为深入的阐述和解析,结合网络教育平台的2794篇学习刊物进行了主题发现和聚类实验,建立了包含3800个词项的词库,通过kmeans算法和合并向量算法(UVM)分两步解决了主题聚类问题。提出了文本挖掘实验的一般方法,并对层次聚类中文本距离的算法提出了改进。实验结果表明,该平台刊物的主题整体相似度比较好,但主题过于集中使得许多刊物的内容不具有辨识度,影响用户对主题的定位。
With the popularity of intelligent terminals,the demand of text topic mining is becoming more prevalent in many different domains.Theme modeling is the kernel of text topic mining.LDA (latent Dirichlet allocation) generating model is a probability model based on Bayesian framework,and it solves the problem of text potential topic extraction based on semantic association.The key technology of text clustering process,including LDA generating model,data sampling,model evaluation,was described and analyzed in depth.Theme discovery and clustering experiments were carried out in 2 794 learning journals on the network education platform.A thesaurus containing 3 800 terms was established.The problem of topic clustering was solved by kmeans algorithm and UVM (union vector method) algorithm in two steps.Meanwhile a general method of text mining experiment was proposed,and the algorithm of text distance in hierarchical clustering was improved.The experimental results show that the overall similarity of topics in the platform is good,but the focus of topics makes the content of many journals not identifiable,which affects the user's positioning of topics.
作者
杨传春
张冰雪
李仁德
郭强
YANG Chuanchun;ZHANG Bingxue;LI Rende;GUO Qiang(Research Center of Complex Systems Science,University of Shanghai for Science and Technology,Shanghai 200093,China;MPA Education Center,University of Shanghai for Science and Technology,Shanghai 200093,China)
出处
《上海理工大学学报》
CAS
CSCD
北大核心
2019年第3期273-280,306,共9页
Journal of University of Shanghai For Science and Technology
关键词
LDA模型
生成模型
主题发现
层次聚类
文本挖掘
LDA model
generating model
topic discovery
hierarchical clustering
text mining