期刊文献+

基于语义的中文文本聚类最佳簇数研究

Study on semantic-based Chinese text optimal number of clusters
下载PDF
导出
摘要 分析了聚类数目的确定对大样本数据聚类效果的影响,对目前聚类质量衡量指标的几个主要流行观点进行了剖析。利用文本相似度的概念对文本语义最佳聚类数问题进行了研究,提出了一种基于聚类过程的文本最佳聚类数算法CTBP,其主要思想是在文本向量集的每个文本向量中抽取出一个词汇,按相似度有序排列,用增量逐层划分以得到最优划分所对应的簇类数。这样通过扫描一遍数据就可以获得多个统计信息,最后求出最优解。实验结果表明了该算法的高质量和高效率。 The effect of the cluster numbers on the large sample data cluster is analyzed, and some prevailing ideas of measurement index for the clustering quality are expounded. The optimal class number of text semantic are studied by the concept of text similarity, and an optimal number of clusters algorithm CTBP in clustering process is presented, and the main idea is to extract a word in each text vector and came into being ordered to array with text similarity, and the class number in optimal dividing has been used to get from the increment which is divided layer by layer. Statistical information can get from using scanning the data a time, and finally obtained the optimal solution. The experimental result shows that our method is helpful to develop speed and quality.
作者 刘金岭
出处 《计算机工程与设计》 CSCD 北大核心 2010年第9期2034-2036,2100,共4页 Computer Engineering and Design
关键词 文本聚类 聚类簇数 增量 划分 CTBP text clustering cluster class number increment division CTBP
  • 相关文献

参考文献9

  • 1汪中,刘贵全,陈恩红.一种优化初始中心点的K-means算法[J].模式识别与人工智能,2009,22(2):299-304. 被引量:140
  • 2Kapp AV,Tibshirani R.Are clusters found in one dataset present in another dataset?[J].Biostatisties,2007,8(1):9-31.
  • 3L I H,YAMAN ISH I K.Topic analysis using a finite mixture model[J].Information Processing and Management,2003,39(3):521-541.
  • 4吴云芳,王淼,金澎,俞士汶.多分类器集成的汉语词义消歧研究[J].计算机研究与发展,2008,45(8):1354-1361. 被引量:14
  • 5刘金岭.基于语义的高质量中文短信文本聚类算法[J].计算机工程,2009,35(10):201-202. 被引量:30
  • 6Sun H,Wang S,Jiang Q.FCM-based model selection algorithms for determining the number of cluster[J].Pattern Recognition,2004,37(10):2027-2037.
  • 7Foss A,Zaiane OR.A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets[C].Kumar V,Tsumoto S.Proc of the ICDM.Los Alamitos:IEEE Computer Society Press,2002:179-186.
  • 8Agrawal R,Gehrke J,Gunopulos D,et al.Automatic subspace clustering of high dimensional data[J].Data Mining and Knowledge Discovery,2005,11(1):5-33.
  • 9刘金岭,刘国香.Huffman编码的优化[J].河北师范大学学报(自然科学版),2006,30(1):29-32. 被引量:2

二级参考文献33

  • 1李永森,杨善林,马溪骏,胡笑旋,陈增明.空间聚类算法中的K值优化问题研究[J].系统仿真学报,2006,18(3):573-576. 被引量:39
  • 2全昌勤,何婷婷,姬东鸿,余绍文.基于多分类器决策的词义消歧方法[J].计算机研究与发展,2006,43(5):933-939. 被引量:8
  • 3钱线,黄萱菁,吴立德.初始化K-means的谱方法[J].自动化学报,2007,33(4):342-346. 被引量:32
  • 4Han J, Kamber M. Data Mining Concepts and Techniques. Orlando, USA: Morgan Kaufmann Publishers, 2001
  • 5Huang J Z, Ng M K, Rang Hongqiang, et al. Automated Variable Weighting in K-means Type Clustering. IEEE Trans on Pattern Analysis and Machine Intelligence, 2005, 27 (5) : 657 - 668
  • 6Dhillon I S, Guan Yuqiang, Kogan J. Refining Clusters in High Dimensional Text Data//Proc of the 2nd SIAM Workshop on Clustering High Dimensional Data. Arlington, USA, 2002 : 59 - 66
  • 7Zhang B. Generalized K-Harmonic Means: Dynamic Weighting of Data in Unsupervised Learning//Proc of the 1 st SIAM International Conference on Data Mining. Chicago, USA, 2001 : 1 - 13
  • 8Sarafis I, Zalzala A M S, Trinder P W. A Genetic Rule-Based Data Clustering Toolkit//Proc of the Congress on Evolutionary Computation. Honolulu, USA, 2002 : 1238 - 1243
  • 9Ma J, Perkins S. Time-Series Novelty Detection Using One-Class Support Vector Machines// Proc of the International Joint Conference on Neural Networks. Portland, USA, 2003, Ⅲ: 1741 - 1745
  • 10Kaufman L,Rousseeuw P J. Finding Groups in Data: An Introduction to Cluster Analysis. New York, USA: John Wiley & Sons, 1990

共引文献179

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部