摘要
针对AP算法运算时间消耗过高,相似性矩阵参考度值影响聚类效果等问题,本文提出了一种基于Spark改进的AP算法,首先对无权的数据集应用融合的ECC(边聚集系数)和CD算法进行加权处理,并根据加权的结果设置相似性矩阵的参考度提高聚类精度,并在Spark平台并行化改进AP算法减少运算时间。应用PPI数据,识别蛋白质复合物,并引入F值聚类评价指标对结果进行比较,实验结果表明:该算法在不同的PPI网络上均有较高的聚类精度优于clusterone等经典的聚类算法,并且提高了运行效率,有良好的扩展性。
AP has a high computational time complexity and the similarity matrix reference value affects the clustering effect.In response to these problems,this paper proposes an improved AP algorithm based on Spark(SIAP).First,the unweighted data set are weighted by ECC(Edge Clustering Coefficient)and CD algorithms,to improve clustering accuracy.The reference degree of the similarity matrix is set according to the weighted result,and parallel the improved AP algorithm on spark platform to reduce running time.PPI(Protein-Protein Interaction)data is used to identify the protein complexes,and the F-Measure clustering evaluation index is introduced to compare the results.The experimental results show that the algorithm has higher clustering accuracy on different PPI networks.It is superior to clusterone and other classical clustering algorithms,and it improves the operating efficiency with good scalability.
作者
邓超
刘桂霞
孙立岩
王荣全
DENG Chao;LIU Guixia;SUN Liyan;WANG Rongquan(College of Software, Jilin University, Changchun 130012, China;College of Computer Science and Technology, Jilin University, Changchun 130012, China)
出处
《哈尔滨工程大学学报》
EI
CAS
CSCD
北大核心
2020年第11期1710-1714,共5页
Journal of Harbin Engineering University
基金
国家自然科学基金项目(61772226,61373051,61862056).