基于边界距离的多向量文本聚类方法

Border distance based multi-vector document clustering method

下载PDF

导出

摘要文本聚类是自然语言处理中的一项重要研究课题,主要应用于信息检索和Web挖掘等领域。其中的关键是文本的表示和聚类算法。在层次聚类的基础上,提出了一种新的基于边界距离的层次聚类算法,该方法通过选择两个类间边缘样本点的距离作为类间距离,有效地利用类的边界信息,提高类间距离计算的准确性。综合考虑不同词性特征对文本的贡献,采用多向量模型对文本进行表示。不同文本集上的实验表明,基于边界距离的多向量文本聚类算法取得了较好的性能。 Document clustering is an important task of natural language processing and is widely applicable in areas such as information retrieval and web mining.The representation of document and the clustering algorithm are the key issues of document clustering.In order to improve the precision of distance calculation,this paper put forward a novel border distance based document clustering approach,which chooses the average of distances between documents at the border of different clusters as the similarity between this pairwise of clusters and takes advantage of the border information of the clusters.Considering the contribution of different kinds of terms,documents are represented by multi-vector.Experimental results of different corpus have shown that the proposed approach outperforms other widely used hierarchical clustering methods.

作者蔡东风王智超季铎张桂平

机构地区沈阳航空工业学院自然语言处理研究室

出处《计算机工程与应用》 CSCD 北大核心 2008年第3期198-201,共4页 Computer Engineering and Applications

基金国家高技术研究发展计划(863)(the National High- Tech Research and Development Plan of China under Grant No.2006AA01Z148) 教育部科学技术研究重点项目(the Scientific Key Project of Ministry of Education of China under Grant No.207148)

关键词距离计算文本表示多向量文本聚类 distance computation document representation multi-vector document clustering

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1钱卫宁,周傲英.从多角度分析现有聚类算法(英文)[J].软件学报,2002,13(8):1382-1394. 被引量：86

二级参考文献36

1[1]Fasulo, D. An analysis of recent work on clustering algorithms. Technical Report, Department of Computer Science and Engineering, University of Washington, 1999. http://www.cs.washington.edu.
2[2]Baraldi, A., Blonda, P. A survey of fuzzy clustering algorithms for pattern recognition. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 1999,29:786～801.
3[3]Keim, D.A., Hinneburg, A. Clustering techniques for large data sets - from the past to the future. Tutorial Notes for ACM SIGKDD 1999 International Conference on Knowledge Discovery and Data Mining. San Diego, CA, ACM, 1999. 141～181.
4[4]McQueen, J. Some methods for classification and Analysis of Multivariate Observations. In: LeCam, L., Neyman, J., eds. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967. 281～297.
5[5]Zhang, T., Ramakrishnan, R., Livny, M. BIRCH: an efficient data clustering method for very large databases. In: Jagadish, H.V., Mumick, I.S., eds. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. Quebec: ACM Press, 1996. 103～114.
6[6]Guha, S., Rastogi, R., Shim, K. CURE: an efficient clustering algorithm for large databases. In: Haas, L.M., Tiwary, A., eds. Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. Seattle: ACM Press, 1998. 73～84.
7[7]Beyer, K.S., Goldstein, J., Ramakrishnan, R., et al. When is 'nearest neighbor' meaningful? In: Beeri, C., Buneman, P., eds. Proceedings of the 7th International Conference on Data Theory, ICDT'99. LNCS1540, Jerusalem, Israel: Springer, 1999. 217～235.
8[8]Ester, M., Kriegel, H.-P., Sander, J., et al. A density-based algorithm for discovering clusters in large spatial databases with noises. In: Simoudis, E., Han, J., Fayyad, U.M., eds. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD'96). AAAI Press, 1996. 226～231.
9[9]Ester, M., Kriegel, H.-P., Sander, J., et al. Incremental clustering for mining in a data warehousing environment. In: Gupta, A., Shmueli, O., Widom, J., eds. Proceedings of the 24th International Conference on Very Large Data Bases. New York: Morgan Kaufmann, 1998. 323～333.
10[10]Sander, J., Ester, M., Kriegel, H.-P., et al. Density-Based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery, 1998,2(2):169～194.

共引文献85

1刘英林,陈玉柱,丁文静,程红云.钢卷表面缺陷分布特征发现方法研究[J].冶金自动化,2020,44(1):27-31. 被引量：2
2毛颖颖,杨新凯.融合拓扑势的自适应层次聚类算法研究[J].计算机应用研究,2020,37(S01):37-39.
3李华,贾雪.基于FM度量的自适应K-Means聚类的工业生产运行基准挖掘[J].长春大学学报,2022,32(4):22-27.
4Qi Zhang,Jianshe Cao,Yanfeng Sui.Development of a research platform for BEPCⅡ accelerator fault diagnosis[J].Radiation Detection Technology and Methods,2020,4(3):269-276.
5郭景峰,赵玉艳,边伟峰,李晶.基于改进的凝聚性和分离性的层次聚类算法[J].计算机研究与发展,2008,45(z1):202-206. 被引量：15
6王建会,申展,胡运发.一种实用高效的聚类算法[J].软件学报,2004,15(5):697-705. 被引量：26
7张虎,郑家恒,刘江.语料库词性标注一致性检查方法研究[J].中文信息学报,2004,18(5):11-16. 被引量：9
8杨涛,李龙澍.一种基于粗糙集聚类的数据约简算法[J].系统仿真学报,2004,16(10):2195-2197. 被引量：5
9张虎,郑家恒,刘江.汉语语料库词性标注自动校对方法研究[J].计算机应用,2005,25(1):17-19. 被引量：1
10栾丽华,吉根林.一种基于四叉树的快速聚类算法[J].计算机应用,2005,25(5):1001-1003. 被引量：6

1李欣雨,袁方,刘宇,李琮.面向中文新闻话题检测的多向量文本聚类方法[J].郑州大学学报（理学版）,2016,48(2):47-52. 被引量：6
2陈海光,韩鹏,吴华峰,高传善.无线传感器网络中基于多向量的拥塞控制[J].通讯和计算机（中英文版）,2007,4(2):38-41.
3徐永红,洪文学,高直.模式特征的几何代数多向量表示方法[J].燕山大学学报,2010,34(2):119-122. 被引量：1
4魏登萍,王挺,王戟.融合描述文档结构和参引特征的Web服务发现[J].软件学报,2011,22(9):2006-2019. 被引量：11
5张新红,张帆,张军亮.一种改进的二值图像质量评价方法[J].计算机工程与科学,2010,32(6):52-54. 被引量：3
6刘海峰,苏展,刘守生.一种基于词频信息的改进CHI文本特征选择[J].计算机工程与应用,2013,49(22):110-114. 被引量：24
7邹磊,蔡自兴,任孝平.基于虚拟力的自组织覆盖算法[J].计算机工程,2010,36(14):93-95. 被引量：15
8王振宇,吴泽衡,唐远华.基于多向量和二次聚类的话题检测[J].计算机工程与设计,2012,33(8):3214-3218. 被引量：3
9庞海杰,刘春强.语义环境下的多维度微博舆情信息关联检测方法[J].山东科技大学学报（自然科学版）,2015,34(4):62-66.
10徐永红,洪文学,高直,郑成博.多元数据升维变换的几何代数表示原理[J].燕山大学学报,2008,32(5):393-396. 被引量：3

计算机工程与应用

2008年第3期

浏览历史

内容加载中请稍等...

基于边界距离的多向量文本聚类方法

参考文献1

二级参考文献36

共引文献85

相关作者

相关机构

相关主题

浏览历史