基于语义的中文文本聚类最佳簇数研究

Study on semantic-based Chinese text optimal number of clusters

下载PDF

导出

摘要分析了聚类数目的确定对大样本数据聚类效果的影响,对目前聚类质量衡量指标的几个主要流行观点进行了剖析。利用文本相似度的概念对文本语义最佳聚类数问题进行了研究,提出了一种基于聚类过程的文本最佳聚类数算法CTBP,其主要思想是在文本向量集的每个文本向量中抽取出一个词汇,按相似度有序排列,用增量逐层划分以得到最优划分所对应的簇类数。这样通过扫描一遍数据就可以获得多个统计信息,最后求出最优解。实验结果表明了该算法的高质量和高效率。 The effect of the cluster numbers on the large sample data cluster is analyzed, and some prevailing ideas of measurement index for the clustering quality are expounded. The optimal class number of text semantic are studied by the concept of text similarity, and an optimal number of clusters algorithm CTBP in clustering process is presented, and the main idea is to extract a word in each text vector and came into being ordered to array with text similarity, and the class number in optimal dividing has been used to get from the increment which is divided layer by layer. Statistical information can get from using scanning the data a time, and finally obtained the optimal solution. The experimental result shows that our method is helpful to develop speed and quality.

作者刘金岭

机构地区江苏淮阴工学院计算机系

出处《计算机工程与设计》 CSCD 北大核心 2010年第9期2034-2036,2100,共4页 Computer Engineering and Design

关键词文本聚类聚类簇数增量划分 CTBP text clustering cluster class number increment division CTBP

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献9

1汪中,刘贵全,陈恩红.一种优化初始中心点的K-means算法[J].模式识别与人工智能,2009,22(2):299-304. 被引量：140
2Kapp AV,Tibshirani R.Are clusters found in one dataset present in another dataset?[J].Biostatisties,2007,8(1):9-31.
3L I H,YAMAN ISH I K.Topic analysis using a finite mixture model[J].Information Processing and Management,2003,39(3):521-541.
4吴云芳,王淼,金澎,俞士汶.多分类器集成的汉语词义消歧研究[J].计算机研究与发展,2008,45(8):1354-1361. 被引量：14
5刘金岭.基于语义的高质量中文短信文本聚类算法[J].计算机工程,2009,35(10):201-202. 被引量：30
6Sun H,Wang S,Jiang Q.FCM-based model selection algorithms for determining the number of cluster[J].Pattern Recognition,2004,37(10):2027-2037.
7Foss A,Zaiane OR.A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets[C].Kumar V,Tsumoto S.Proc of the ICDM.Los Alamitos:IEEE Computer Society Press,2002:179-186.
8Agrawal R,Gehrke J,Gunopulos D,et al.Automatic subspace clustering of high dimensional data[J].Data Mining and Knowledge Discovery,2005,11(1):5-33.
9刘金岭,刘国香.Huffman编码的优化[J].河北师范大学学报（自然科学版）,2006,30(1):29-32. 被引量：2

二级参考文献33

1李永森,杨善林,马溪骏,胡笑旋,陈增明.空间聚类算法中的K值优化问题研究[J].系统仿真学报,2006,18(3):573-576. 被引量：39
2全昌勤,何婷婷,姬东鸿,余绍文.基于多分类器决策的词义消歧方法[J].计算机研究与发展,2006,43(5):933-939. 被引量：8
3钱线,黄萱菁,吴立德.初始化K-means的谱方法[J].自动化学报,2007,33(4):342-346. 被引量：32
4Han J, Kamber M. Data Mining Concepts and Techniques. Orlando, USA: Morgan Kaufmann Publishers, 2001
5Huang J Z, Ng M K, Rang Hongqiang, et al. Automated Variable Weighting in K-means Type Clustering. IEEE Trans on Pattern Analysis and Machine Intelligence, 2005, 27 (5) : 657 - 668
6Dhillon I S, Guan Yuqiang, Kogan J. Refining Clusters in High Dimensional Text Data//Proc of the 2nd SIAM Workshop on Clustering High Dimensional Data. Arlington, USA, 2002 : 59 - 66
7Zhang B. Generalized K-Harmonic Means: Dynamic Weighting of Data in Unsupervised Learning//Proc of the 1 st SIAM International Conference on Data Mining. Chicago, USA, 2001 : 1 - 13
8Sarafis I, Zalzala A M S, Trinder P W. A Genetic Rule-Based Data Clustering Toolkit//Proc of the Congress on Evolutionary Computation. Honolulu, USA, 2002 : 1238 - 1243
9Ma J, Perkins S. Time-Series Novelty Detection Using One-Class Support Vector Machines// Proc of the International Joint Conference on Neural Networks. Portland, USA, 2003, Ⅲ: 1741 - 1745
10Kaufman L,Rousseeuw P J. Finding Groups in Data: An Introduction to Cluster Analysis. New York, USA: John Wiley & Sons, 1990

共引文献179

1王海,高岭,陈东棋,任杰.一种基于用户行为的嵌入式功耗优化方法[J].系统仿真学报,2015,27(2):320-326.
2陈兴蜀,吴小松,王文贤,王海舟.基于特征关联度的K-means初始聚类中心优化算法[J].四川大学学报（工程科学版）,2015,47(1):13-19. 被引量：29
3亢俊健,杜在林,张新东,朱群英.使用信息增益方法选择分类器[J].计算机工程与应用,2009,45(14):158-160.
4李纲,寇广增,夏晨曦,全吉,张东赫.中文词义消歧上下文最优边界问题研究[J].现代图书情报技术,2009(7):49-53. 被引量：1
5于丽丽,丁德鑫,曲维光,陈小荷,李惠.基于条件随机场的古汉语词义消歧研究[J].微电子学与计算机,2009,26(10):45-48. 被引量：13
6李晓飞.Huffman编解码及其快速算法研究[J].现代电子技术,2009,32(21):102-104. 被引量：9
7孙可,刘杰,王学颖.K均值聚类算法初始质心选择的改进[J].沈阳师范大学学报（自然科学版）,2009,27(4):448-450. 被引量：15
8邵艳秋,穗志方,吴云芳.基于词汇语义特征的中文语义角色标注研究[J].中文信息学报,2009,23(6):3-10. 被引量：7
9刘金岭.基于主题的中文短信文本分类研究[J].计算机工程,2010,36(4):30-32. 被引量：14
10刘金岭.基于语义密度的文本聚类研究[J].计算机工程,2010,36(5):81-83. 被引量：7

1金建国.聚类方法综述[J].计算机科学,2014,41(B11):288-293. 被引量：78
2隋春平,颜云辉,赵明扬.压电智能结构中驱动器布片方式和数目的确定[J].机械设计与制造,2004(6):19-20.
3X＇caliboar.《热血传奇》玩家光芒[J].大众软件,2005(13):175-175.
4五种解压法未必真解压[J].宁夏教育,2009(7):81-81.
5吕春生,张俊峰.电子商务网站的设计与推广[J].农业网络信息,2006(2):69-72. 被引量：7
6王丽丽,陈瑞志,付世凤.基于核的密度函数聚类的彩色图像分割方法[J].电脑知识与技术,2010,6(7):5292-5294.
7王备,王继成.图像分割中模糊聚类数目的确定[J].计算机技术与发展,2007,17(10):162-164. 被引量：7
8陈亮,丁国辉,郭雷.基于直方图互确认的图像阈值化分割[J].红外与毫米波学报,2011,30(1):80-84. 被引量：10
9刘喜成,韩承德.感知器的布尔映射能力分析[J].模式识别与人工智能,1997,10(4):301-304. 被引量：1
10吴育文,陈琛,康文豪.基于BP网络的人脸朝向识别模型[J].影像技术,2012,24(1):29-32.

计算机工程与设计

2010年第9期

浏览历史

内容加载中请稍等...

基于语义的中文文本聚类最佳簇数研究

参考文献9

二级参考文献33

共引文献179

相关作者

相关机构

相关主题

浏览历史