摘要
过程挖掘的目标是从软件系统产生的日志数据中提取出有价值信息,用于配置或优化已实施的业务过程.与此同时,大数据、物联网等技术的发展不仅使得业务内容愈加复杂,更是加速了业务演化的速度.在此背景下,有必要对原始日志进行划分,使得事件日志通过分解而被更有效地分析,进而提升过程挖掘的质量.日志划分的宗旨是根据不同问题采取不同方法将原始事件日志划分为多个子日志,为后续的过程挖掘研究提供支撑.模型发现是过程挖掘中最重要的应用场景,而该场景面临的两大难题就是模型过于复杂和模型不正确.当前,解决这两个难题的方法分别是轨迹聚类和概念漂移,而这两类方法的本质都是对原始事件日志进行了划分.本文针对轨迹聚类和概念漂移两个分支进行归纳总结,试图厘清日志划分内容中这两个分支的异同点.接着,通过文献规约系统地对现有研究进行统计与分析,揭露了两个研究分支的发展趋势.然后,梳理了现有研究方法的主要思路,将轨迹聚类分为距离驱动、模型驱动和混合聚类三类,并将概念漂移分为单一类型和复合类型两类.最后,利用公开数据集测试不同类型算法的优缺点,并指出未来研究的发展方向.
Process mining aims to extract the valuable information from event logs generated by software systems,which is often utilized for configuring and optimizing the ongoing business process models.Meanwhile,the development of information technologies(such as big data and the Internet of Things) not only makes process models more structurally-and behaviorallycomplex,but also accelerates the speed of business evolution.Under this circumstance,it is necessary to analyze,design and simplify the original event log into sub-logs for effectively-reusable purposes.Therefore,it is more instructive for process mining to mine the process models from the sub-logs instead of the original logs.As we know,data division is to improve model performance through subsets analysis,so the purpose of log division is to divide the original event log into multiple sub-logs by adopting different methods according to different issues.The analysis of these sub-logs can provide support for process mining research,especially in process discovery scenario.It is known that process mining has three application scenarios,namely process discovery,conformance checking and process enhancement.The most crucial learning task in the process mining domain is process discovery that is defined as the construction of a reasonable process model from the original evant log.However,the model mined from the original event log in process discovery scenario is always too complex(spaghetti-like model) and inaccurate(neglect evolution).At present,the solutions to these two problems are trace clustering and concept drift,where the trace clustering can be considered a versatile solution for reducing the complexity of the mined models,and concept drift in process mining is to detect changes in event logs for improving the accuracy of process discovery.In our viewpoints,the principles of the two solutions,trace clustering and concept drift,are the same,because they both improve the quality of the mined models by dividing the original log into multiple sub-logs.In this paper,we summarize the two key branches of log divisiox-trace clustering and concept drift,trying to clarity the similarities and differences between them.We find that trace clustering only considers the similarity between different traces in the original log,while ignoring the timestamp attribute on the traces.In contrast,the focus of concept drift research is to find out the time points(or the locations of the traces in the original log),and then divide the original log into sub-logs based on these time points.Moreover,we systematically summarize and analyze the development trend of the related studies through the literature protocol,and find that the growth trend of the concept drift in the past five years is greater than the growth trend of the trace clustering.More concretely,we classify the methods of trace clustering into three categories(i.e.,distance-driven,rodel-driven and hybrid clustering) and classify the methods of concept drift into two types(i.e.single type and composite type).Firthermore,we use the publiclyr-vailable datasets to evaluate the advantages and disadvantages of different types of methods,and sketch some potential development directions of future research.
作者
林雷蕾
闻立杰
钱忱
宗瓒
王建民
LIN Lei-Lei;WEN Li-Jie;QIAN Chen;ZONG Zan;WANG Jian-Min(Department of Software,Tsinghua University,Beijing 100084;Beijing Key Laboratory of Industrial Big Data System and Application,Beijing 100084)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2022年第9期1946-1968,共23页
Chinese Journal of Computers
基金
国家重点研发计划项目(2019YFB1704003)
国家自然科学基金项目(71690231,62021002)资助.
关键词
过程挖掘
轨迹聚类
概念漂移
业务演化
process mining
trace clustering
concept drift
business evolution