摘要
鉴于线性文本内容组织形式的有序性,将有序的主题内容进行正确的划分,用于挖掘文本中隐藏的信息、知识,是一个值得研究的问题。同时,传统的K-means聚类算法在对线性文本进行聚类时,会造成计算复杂度增加以及无穷迭代或聚类结果混乱等一系列问题。针对以上问题,对传统的K-means算法进行研究,将随机初始化中心点的算法进行改进,提出一种随机均匀初始化中心点算法。该算法充分考虑线性文本的组织结构特性,随机化第一个中心点后,均匀地确定其他中心点,保证了文本子主题的完整划分;与此同时,又采用了设定约束规则的等距点归类法,实现文本迭代次数限制下的自动归类。实验结果表明,该算法在对线性文本进行聚类时,可以有效减少迭代次数并提高聚类精度,最终获得较好的聚类效果。
In view of the orderliness of the organized form of linear texts,it is worthwhile studying to mine the hidden information andknowledge from the text by dividing the subject content correctly. At the same time,the traditional K-means clustering algorithm willconduce to a series of problems such as increasing computational complexity,infinite iteration phenomenon or clustering results confu-sion. For this,we research the traditional K-means algorithm and improve the algorithm of randomly initializing center,based on whichwe propose a random uniform initialization center algorithm. This algorithm gives plenty of considerations to the organizational structureof linear texts. After one central point is randomized,other central points are uniformly determined to ensure the sufficiently division ofthe subtopic. Meantime,we adopt an equidistant point categorization under the constraint rules to realize automatic classification under thelimit of text iteration. The experiment illustrates that the proposed algorithm can effectively cut down iteration times and improve theclustering accuracy when clustering linear texts,obtaining the better clustering outcome at last.
作者
文必龙
李菲
马强
WEN Bi-long;LI Fei;MA Qiang(School of Computer and Information Technology,Northeast Petroleum University,Daqing 163318,China)
出处
《计算机技术与发展》
2018年第9期53-58,共6页
Computer Technology and Development
基金
国家重大专项(2016ZX05033-005-004)
关键词
线性文本
组织结构
随机均匀取点
等距点归类
K-MEANS算法
linear text
organizational structure
random and even center point selection
isometric point classification
K-means algorithm