摘要
大数据是近年来计算机领域兴起的研究热点,通过聚类可以解决诸如数据挖掘、机器学习、文本处理等大数据领域问题。针对传统的DBSCAN算法参数需要人工设定,且算法速度无法适应大数据应用等问题,本文提出了一种DBSCAN优化算法。利用KD树加快查找邻域对象,显著减少算法的运行时间;同时,通过计算所有邻域对象的数学期望,实现密度阈值(Minpts)参数自适应;接着,设计了一种文本聚类流程,通过SD-TF-IDF算法对特征项的权值进行优化,进而完成对文本的聚类任务;最后,将其应用于高校计算机实验文本大数据的挖掘分析中,取得了良好的效果。
Big data is a research hotspot emerging in the computer field in recent years. Clustering can solve problems in the field of big data, such as data mining, machine learning, and text processing. Aiming at the problems that parameters of traditional DBSCAN algorithm need to be set manually and the algorithm speed cannot adapt to the application of big data, a DBSCAN optimization algorithm was proposed. The KD tree was used to speed up the search for neighborhood objects, significantly reducing the running time of the algorithm;at the same time, the density threshold (Minpts) was adaptive by calculating the mathematical expectations of all neighborhood objects;then, a text clustering process was designed, and the weights of feature items were optimized through SD-TF-IDF to complete the text clustering task;finally, it was applied to the mining and analysis of big data of computer experimental text in colleges and universities, and good results had been achieved.
出处
《计算机科学与应用》
2020年第5期906-913,共8页
Computer Science and Application
基金
教育部科技发展中心高校产学研创新基金——新一代信息技术创新项目(2018A01015),教育部科技发展中心高校产学研创新基金——新一代信息技术创新项目(2018A02027),国家自然科学基金项目(61871475,61471133).