期刊文献+

一种基于MinHash的改进新闻文本聚类算法 被引量:4

An Improved News Text Clustering Algorithm Based on MinHash
下载PDF
导出
摘要 信息技术的不断发展,带来的是网络上新闻文本的快速增长,面对大量的新闻文本,对其进行有效聚类就显得十分重要。基于上述需求,提出一种基于MinHash的DBSCAN聚类算法。针对传统向量空间模型文本聚类存在的数据维度高、计算复杂度大、资源消耗多的问题,该算法使用Min Hash对所有文本的文本特征词集合进行降维,从而有效减少了资源的浪费。对新得到的特征矩阵中的数据任意两两计算Jaccard系数,将每一个结果与DBSCAN聚类中给定的邻域半径Eps进行比较并计算所有距离大于邻域半径Eps的点的周围节点数目是否大于等于形成一个簇所需要的最小点数MinPts,由此可以判断该文本是否为核心点,是否可以形成簇。实验结果表明,该方法对于新闻文本聚类有着很好的效果,可以对网络上错综复杂的新闻文本进行有效的聚类。 The continuous development of information technology has brought about the rapid growth of news texts on the Internet.In the face of a large number of news texts,it is very important to cluster them effectively.Based on the above requirements,we propose an improved DBSCAN clustering algorithm based on MinHash.In order to solve the problem of high data dimension,high computational complexity and large resource consumption in traditional vector space model text clustering,this algorithm uses MinHash to reduce the dimension of all text feature word sets,thus effectively reducing the wastes of resources.Jaccard coefficient is calculated for any two-by-two data in the obtained characteristics matrix,and each result is compared with the neighborhood radius Eps in DBSCAN clustering and calculated whether all the neighboring nodes whose distances are greater than the neighborhood radius Eps is greater than or equal to MinPts.Therefore,we can determine whether the text is a core point and whether clusters can be formed.Experiment shows that the algorithm has a better effect on news text clustering and can effectively cluster the intricate news text on the Internet.
作者 王安瑾 WANG An-jin(School of Computer Science and Technology,Donghua University,Shanghai 200000,China)
出处 《计算机技术与发展》 2019年第2期39-42,共4页 Computer Technology and Development
基金 国家自然科学基金(61472075)
关键词 MinHash Jaccard系数 DBSCAN 文本聚类 MinHash Jaccard coefficient DBSCAN text-clustering
  • 相关文献

参考文献5

二级参考文献30

  • 1何婷婷,戴文华,焦翠珍.基于混合并行遗传算法的文本聚类研究[J].中文信息学报,2007,21(4):55-60. 被引量:11
  • 2中国互联网络信息中心.第十六次中国互联网络发展状况统计报告[EB/OL].http://www.cnnic.net.cn/in-dex/OE/00/11/index.htm,2005,07-01
  • 3Andrei Z. Broder, Steven C. Glassman. Syntactic Clustering of the Web [DB/OL]. http://gatekeeper. research.compaq.com/pub/DEC/SRC/technical--notes/SRC--1997--015 html
  • 4吴军,数学之美系列十三信息指纹及其应用[DB/OL].http://www.googlechinablog.com/2006/08/blog-post.html
  • 5Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma. Detecting Near--Duplicates for Web Crawlng[C]. In ternational World Wide Web Conference, Banff, Alberta, Canada, New York, USA: ACM, 2007: 141-- 150
  • 6Moses S. Charikar, Similarity Estimation Tech niques from Rounding Algorithms[C]. Annual ACM Sym posium on Theory of Computing, Montreal, Quebec, Cana da, New York, USA:ACM, 2002 : 380-388
  • 7[EB/OL].http:/Itech.sina.com.cn/i/2014-03-27 /10239 276800.shtlm,2014-03-27.
  • 8Sahon G, Wong A, Yang C S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
  • 9Deerwester S C, Dumais S T, Landauer T K, et al. In- dexing by latent semantic analysis[J]. JASIS, 1990, 41 (6): 391-407.
  • 10Thomas Hofmann.Unsupervised Learning by Probabi- listic Latent Semantic Analysis[J].Machine Learning, 2001,42(1) :177-196.

共引文献92

同被引文献32

引证文献4

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部