摘要
网络结构数据在现今生活中广泛存在,但由于数据结构稀疏、规模较大等特性,难以直接利用现有的机器学习算法对数据进行分析.网络表示学习算法的出现,通过将高维数据映射到低维向量空间,解决了上述问题.但是网络表示学习算法中存在大量超级参数,参数的选择与数据分析任务密切相关且对算法性能有明显影响,如何针对数据分析任务,通用地对多种网络表示学习算法进行超级参数调整,以获取不同算法的最优性能,实现算法间性能的公平比较,从而选择出最优者对数据进行分析,是一个亟待解决的问题.此外,对算法进行超级参数调整通常需要花费较长时间,且由于网络结构数据规模通常较大,还会有内存占用过高问题的存在,因此如何能够在有资源限制(时间、内存占用)的条件下进行超级参数调整,是面临的另一个问题.基于上述两个问题,本文提出了基于超级参数调整的网络表示学习算法性能公平比较框架JITNREv,能够在有资源限制的条件下通用对多种网络表示学习算法进行超级参数调整,通过获取不同算法针对相同数据分析任务的性能最优值,实现算法之间的性能公平比较.该框架具有4个松耦合且可扩展的组件,组件间仅通过数据流进行交互,并在闭环结构中完成样本的测试优化,满足了框架的通用性.JITNREv基于拉丁超立方采样对超级参数进行采样;根据“当前最优值附近,有更大概率出现更优值”的假设对采样范围进行剪枝;针对超大规模数据集,提出了图粗化方式在保留数据结构的基础上压缩数据规模,满足了资源限制条件下对超级参数进行调整的要求.框架还融合了网络表示学习算法常用的评测数据集、评测指标和数据分析应用,实现了框架的易用性.实验证明JITNREv框架能够在资源限制条件下稳定提高算法性能,例如,针对GCN算法的节点分类任务相比默认参数设置,JITNREv框架能够将性能提升31%.
Network data are ubiquitous in real-world applications to represent complex relationships of objects,e.g.,social networks,reference networks,and web networks,etc.However,due to the large-scale and high-dimensional sparse representation of network datasets,it is hard to directly apply off-the-shelve machine learning methods for analysis.Network representation learning(NRL)can generate succinct node representations for large-scale networks,and serve as a bridge between machine learning methods and network data.It has attracted great research interests from both academia and industry.Despite the wide adoption of NRL algorithms,the setting of their hyperparameters remains an impacting factor to the success of their applications,as hyperparameters can influence the algorithms’performance results to a great extent.How to generate a task-aware set of hyperparameters for different NRL algorithms in order to obtain their best performance,achieve their performance fair comparison,and select the most suitable NRL algorithm to analyze the network data are fundamental questions to be answered before the application of NRL algorithms.In addition,hyperparameters tuning is a time-consuming task,and the massive scale of network datasets has further complicated the problem by incurring a high memory footprint.So,how to tune NRL algorithms’hyperparameters within given resource constraints such as the time constraint or the memory limit is also a problem.Regarding the above two problems,in this work,we propose an easy-to-use framework named JITNREv,to compare NRL algorithms fairly within resource constraints based on hyperparameters tuning.The framework has four loosely coupled components and adopts a sample-test-optimize process in a closed loop.The four main components are named hyperparameter sampler,NRL algorithm manipulator,performance evaluator,and hyperparameter sampling space optimizer.All components interact with each other through data flow.We use the divide-and-diverge sampling method based on Latin Hypercube Sampling to sample a set of hyperparameters,and trim the sample space around the previous best configuration according to the assumption that“around the point with the best performance in the sample set we will be more likely to find other points with similar or better performance”.Massive scale of network data also brings great challenges to hyperparameter tuning,since the computational cost of NRL algorithms increases in proportion to the network scale.So we use the graph coarsening model to reduce data size and preserve graph structural information.Therefore,JITNREv can easily meet the resource constraints set by users.Besides,the framework also integrates representative algorithms,general evaluation datasets,commonly used evaluation metrics,and data analysis applications for easy use of the framework.Extensive experiments demonstrate that JITNREv can stably improve the performance of general NRL algorithms only by hyperparameter tuning,thus enabling the fair comparisons of NRL algorithms at their best performances.As an example,for the node classification task of the GCN algorithm,JITNREv can increase the accuracy by up to 31%compared with the default hyperparameter settings.
作者
郭梦影
孙振宇
朱妤晴
包云岗
GUO Meng-Ying;SUN Zhen-Yu;ZHU Yu-Qing;BAO Yun-Gang(Center for Advanced Computer Systems,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190;University of Chinese Academy of Sciences,Beijing 100049;Beijing National Research Center of Information Science and Technology(Tsinghua University),Beijing 100084;National Engineering Laboratory of Big Data System Software,Beijing 100084)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2022年第5期897-917,共21页
Chinese Journal of Computers
基金
国家重点研发计划(2016YFB1000201)
国家自然科学基金项目(61420106013)资助.
关键词
网络表示学习
网络嵌入
图卷积网络
自动化机器学习
超级参数调整
network representation learning
network embedding
graph convolutional network
automated machine learning
hyperparameter tuning