摘要
随着国家高性能计算环境各个节点产生日志数量的不断增加,采用传统的人工方式进行异常日志分析已不能满足日常的分析需求.提出一种异常日志流量模式的定义方法:同一节点相同时间片内日志类型的有序排列代表了一种日志流量模式,并以该方法为出发点,实现了一个异常日志流量模式检测方法,用来自动挖掘异常日志流量模式.该方法从系统日志入手,根据日志内容的文本相似度进行自动分类.然后将相同时间片内日志各个类型出现的次数作为输入特征,基于主成分分析的异常检测方法对该输入进行异常检测,得到大量异常的日志类型序列.之后,使用基于最长公共子序列的距离度量对这些序列进行层次聚类,并将聚类结果进行自适应K项集算法,以得出不同异常日志流量模式的序列代表.将国家高性能计算环境半年产生的日志根据不同时间段(早、晚、夜)使用上述方法进行分析,得出了不同时间段的异常日志流量模式和相互关系.该方法也可以推广到其他分布式系统的系统日志中.
With the increasing number of logs produced by nodes in CNGrid,traditional manual methods for abnormal log analysis can no longer meet the need of daily analysis.This study proposed a method to define the abnormal log traffic pattern:The orderly arrangement of log types in the same node and at the same time slice represents a log traffic pattern.Based on this method,a log traffic pattern detection method was implemented,which was applied in automatically mine of abnormal log traffic pattern.The method starts with system log and classifies automatically according to the text similarity of log content.Then,the frequency of each types of log in the same time slice is taken as the input feature,and the anomaly detection method based on principal component analysis(PCA)is used to detect the abnormal input,and a large number of abnormal log type sequences are obtained.A distance metric based on the longest common subsequence is used to cluster these sequences by hierarchical clustering method.The clustering results are used with the adaptive K-itemset algorithm to get the deputies of the abnormal log flow modes.The above method was used to analyze the logs generated in the national high performance computing environment CNGrid in half a year according to different time periods(morning,night,midnight),and has obtained the abnormal log traffic patterns and their relationships in different time periods.The method can also be extended to the system logs of other distributed systems.
作者
王晓东
赵一宁
肖海力
迟学斌
王小宁
WANG Xiao-Dong;ZHAO Yi-Ning;XIAO Hai-Li;CHI Xue-Bin;WANG Xiao-Ning(Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China)
出处
《软件学报》
EI
CSCD
北大核心
2020年第10期3295-3308,共14页
Journal of Software
基金
国家重点研发计划(2018YFB0204002)
国家自然科学基金(61702477)。
关键词
异常日志流量
主成分分析
层次聚类
最长公共子序列
自适应K项集算法
abnormal log flow
principal component analysis
hierarchical clustering
longest common subsequence
adaptive K-itemset algorithm