期刊文献+

一种结合主动学习的半监督文档聚类算法 被引量:30

Efficiently Active Learning for Semi-Supervised Document Clustering
下载PDF
导出
摘要 半监督文档聚类,即利用少量具有监督信息的数据来辅助无监督文档聚类,近几年来逐渐成为机器学习和数据挖掘领域研究的热点问题.由于获取大量监督信息费时费力,因此,国内外学者考虑如何获得少量但对聚类性能提高显著的监督信息.提出一种结合主动学习的半监督文档聚类算法,通过引入成对约束信息指导DBSCAN的聚类过程来提高聚类性能,得到一种半监督文档聚类算法Cons-DBSCAN.通过对约束集中所含信息量的衡量和对DBSCAN算法本身的分析,提出了一种启发式的主动学习算法,能够选取含信息量大的成对约束集,从而能够更高效地辅助半监督文档聚类.实验结果表明,所提出的算法能够高效地进行文档聚类.通过主动学习算法获得的成对约束集,能够显著地提高聚类性能.并且,算法的性能优于两个代表性的结合主动学习的半监督聚类算法. Semi-Supervised document clustering and employing limited prior knowledge to aid in unsupervised clustering, have recently become a topic of significant interest to data mining and machine learning communities. Because receiving supervised data may be expensive, it is important to attain the most informative knowledge to improve the clustering performance. This paper presents a semi-supervised document clustering algorithm with active learning for pairwise constraints, aiming at getting improved clustering performance. The semi-supervised document clustering algorithm is a constrained DBSCAN (cons-DBSCAN) algorithm, which incorporates pairwise constraints to guide the clustering process in DBSCAN. Basing on measure of constraint set utility and analysis of DBSCAN algorithm, an active learning approach is proposed to select informative document pairs for obtaining user feedbacks. Experimental results show that this proposed approach is effective in document clustering. The clustering performance of active Cons-DBSCAN has dramatically improved with selected pairwise constraints. Moreover, the proposed approach performs better than the two representative methods.
出处 《软件学报》 EI CSCD 北大核心 2012年第6期1486-1499,共14页 Journal of Software
基金 国家自然科学基金(61105052,61070232) 湖南省自然科学基金(11JJ4051) 湖南省教育厅一般项目(10C1262) 湘潭大学博士启动基金(10QDZ42) 中国科学院计算技术研究所智能信息处理重点实验室开放基金(IIP2010-6) 西北师范大学青年教师科研能力提升计划骨干项目(NWNU-LKQN-10-1)
关键词 半监督聚类 文档聚类 主动学习 成对约束 semi-supervised clustering document clustering active learning pairwise constraint
  • 相关文献

参考文献2

二级参考文献46

  • 1Olivier C, Bernhard S, Alexander Z. Semi-Supervised Learning. Cambridge, USA : MIT Press, 2006 : 3 - 10.
  • 2Blum A, Mitchell T. Combining Labeled and Unlabeled Data with Co-Training//Proe of the 11th Annual Conference on Computational Learning Theory. Madison, USA, 1998 : 92 - 100.
  • 3Zhong Shi. Semi-Supervised Model-Based Document Clustering: A Comparative Study. Machine Learning, 2006, 65 ( 1 ) : 3 - 29.
  • 4Wagstaff K, Cardie C, Rogers S, et al. Constrained K-means Clustering with Background Knowledge // Proc of 18th International Conference on Machine Learning. San Francisco, USA, 2001:577 -584.
  • 5Wagstaff K, Cardie C. Clustering with Instance-Level Constraints// Proc of the 17th International Conference on Machine Learning. SanFrancisco, USA, 2000:1103 - 1110.
  • 6Huang Desheng, Pan Wei. Incorporating Biological Knowledge into Distance-Based Clustering Analysis of Micro Array Gene Expression Data. Bioinformatics, 2006, 22 (10) : 1259 - 1268.
  • 7Tari L, Baral C, Kim S. Fuzzy C-Means Clustering with Prior Biological Knowledge. Journal of Biomedical Informatics, 2009, 42 (1): 74-81.
  • 8Ceccarelli M, Maratea A. Improving Fuzzy Clustering of Biological Data by Metric Learning with Side Information. International Journal of Approximate Reasoning, 2008, 47 ( 1 ) : 45 - 57.
  • 9Huang Ruizhang, Lam W. An Active Learning Framework for Semi Supervised Document Clustering with Language Modeling. Data & Knowledge Engineering, 2008, 68 ( 1 ) : 49 - 67.
  • 10Erman J, Mahanti A, Arlitt M, et al. Offline/Realtime Traffic Classification Using Semi-Supervised Learning. Performance Evaluation, 2007, 64(9/10/11/12): 1194- 1213.

共引文献139

同被引文献258

引证文献30

二级引证文献92

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部