期刊文献+

基于边界距离的多向量文本聚类方法

Border distance based multi-vector document clustering method
下载PDF
导出
摘要 文本聚类是自然语言处理中的一项重要研究课题,主要应用于信息检索和Web挖掘等领域。其中的关键是文本的表示和聚类算法。在层次聚类的基础上,提出了一种新的基于边界距离的层次聚类算法,该方法通过选择两个类间边缘样本点的距离作为类间距离,有效地利用类的边界信息,提高类间距离计算的准确性。综合考虑不同词性特征对文本的贡献,采用多向量模型对文本进行表示。不同文本集上的实验表明,基于边界距离的多向量文本聚类算法取得了较好的性能。 Document clustering is an important task of natural language processing and is widely applicable in areas such as information retrieval and web mining.The representation of document and the clustering algorithm are the key issues of document clustering.In order to improve the precision of distance calculation,this paper put forward a novel border distance based document clustering approach,which chooses the average of distances between documents at the border of different clusters as the similarity between this pairwise of clusters and takes advantage of the border information of the clusters.Considering the contribution of different kinds of terms,documents are represented by multi-vector.Experimental results of different corpus have shown that the proposed approach outperforms other widely used hierarchical clustering methods.
出处 《计算机工程与应用》 CSCD 北大核心 2008年第3期198-201,共4页 Computer Engineering and Applications
基金 国家高技术研究发展计划(863)(the National High- Tech Research and Development Plan of China under Grant No.2006AA01Z148) 教育部科学技术研究重点项目(the Scientific Key Project of Ministry of Education of China under Grant No.207148)
关键词 距离计算 文本表示 多向量 文本聚类 distance computation document representation multi-vector document clustering
  • 相关文献

参考文献1

二级参考文献36

  • 1[1]Fasulo, D. An analysis of recent work on clustering algorithms. Technical Report, Department of Computer Science and Engineering, University of Washington, 1999. http://www.cs.washington.edu.
  • 2[2]Baraldi, A., Blonda, P. A survey of fuzzy clustering algorithms for pattern recognition. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 1999,29:786~801.
  • 3[3]Keim, D.A., Hinneburg, A. Clustering techniques for large data sets - from the past to the future. Tutorial Notes for ACM SIGKDD 1999 International Conference on Knowledge Discovery and Data Mining. San Diego, CA, ACM, 1999. 141~181.
  • 4[4]McQueen, J. Some methods for classification and Analysis of Multivariate Observations. In: LeCam, L., Neyman, J., eds. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967. 281~297.
  • 5[5]Zhang, T., Ramakrishnan, R., Livny, M. BIRCH: an efficient data clustering method for very large databases. In: Jagadish, H.V., Mumick, I.S., eds. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. Quebec: ACM Press, 1996. 103~114.
  • 6[6]Guha, S., Rastogi, R., Shim, K. CURE: an efficient clustering algorithm for large databases. In: Haas, L.M., Tiwary, A., eds. Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. Seattle: ACM Press, 1998. 73~84.
  • 7[7]Beyer, K.S., Goldstein, J., Ramakrishnan, R., et al. When is 'nearest neighbor' meaningful? In: Beeri, C., Buneman, P., eds. Proceedings of the 7th International Conference on Data Theory, ICDT'99. LNCS1540, Jerusalem, Israel: Springer, 1999. 217~235.
  • 8[8]Ester, M., Kriegel, H.-P., Sander, J., et al. A density-based algorithm for discovering clusters in large spatial databases with noises. In: Simoudis, E., Han, J., Fayyad, U.M., eds. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD'96). AAAI Press, 1996. 226~231.
  • 9[9]Ester, M., Kriegel, H.-P., Sander, J., et al. Incremental clustering for mining in a data warehousing environment. In: Gupta, A., Shmueli, O., Widom, J., eds. Proceedings of the 24th International Conference on Very Large Data Bases. New York: Morgan Kaufmann, 1998. 323~333.
  • 10[10]Sander, J., Ester, M., Kriegel, H.-P., et al. Density-Based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery, 1998,2(2):169~194.

共引文献85

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部