摘要
专题检测旨在从大规模文本数据中自动聚类同一主题的相关文本,当前研究主要聚焦于层次聚类与密度聚类等聚类算法框架的分析与应用,在专题内容的分析与表示方面缺乏深入探索。通过对专题颗粒度的分析,提出了一种新型主题与社群联合聚类模型。该模型通过主题一致性辨识内容相关的文本,并借助文本中的命名实体社群,进一步细分内容相关的文本聚类,从而避免大规模文本中"事同人不同"的类似专题错误合并。最后,借助该模型在专题检测的层次和粒度上进一步求精,在搜狐人工标记的69项专题万余篇文档上进行专题检测测试。试验结果表明,该模型聚类纯度高于82%,具有实用价值。
The topic detection(TD)aims at automatically clustering the relevant texts to the same topic from large-scale text data.Current researches usually focus on the analysis and the application of the clustering method framework,such as hierarchical clustering and density clustering,though lack of exploring the analysis and representation of the topic contents.With the analysis of the topic granularity,a new joint clustering model for theme and community is proposed.The model recognizes texts with the same topic from the theme consistency.In virtue of the named social communities in texts,it further distinguishes the text clustering with relevant content,thus avoiding incorrectly merging the similar topics with similar events and different persons from large-scale texts.Finally,the model refines the layer and the granularity of TD.A TD testing is carried out on more than 10 thousand documents with 69 topics marked by Sohu on hand.The experimental result shows that the clustering purity of the model achieves more than 82%,thus the model has practical value.
出处
《指挥信息系统与技术》
2017年第4期64-70,共7页
Command Information System and Technology
基金
国家自然科学基金(61373097
61672368
61672367
61331011)
江苏省科技计划(SBK2015022101)
教育部-中国移动科研基金(MCM20150602)资助项目
关键词
专题检测
主题与社群联合聚类模型
层次聚类
topic detection(TD)
joint clustering model for theme and community
hierarchical clustering